Setup¶
Since Pandas is almost a one stop shop for everything data analysis in python anyway, most plotting is done using df.plot()
syntax, however, you must import Matplotlib since this is a dependency. I would also recommend installing Seaborn for more interesting plot types and statistical features. Plus it has a nice native style.
Dependencies¶
import pandas as pd
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
import seaborn as sns
Display¶
Couple options for visualizing in a Jupyter Notebook. Either present static charts inline
, embed them as interactive elements using notebook
setting, or open the chart in a new windows using a specified backend (e.g. GTK3Agg
for raster graphics, GTK3Cairo
for vector graphics). ipython source (somewhat out of date), matplotlib source BTW, here is info on how to run matplotlib in backend of webserver.
%matplotlib inline
#%matplotlib notebook
#%matplotlib GTK3Cairo
Style¶
Call plt.style.available
to show options, then set the style of choice.
And customize the style sheet manually using rcParams
See more on styling here: https://matplotlib.org/users/customizing.html
plt.style.use('seaborn-white')
mpl.rc('figure')
mpl.rc('savefig', transparent=True, dpi=700, bbox='tight', pad_inches=.05, format='png')
For colors, there are endless presets in matplotlib or seaborn. But, you can always construct your own custom length arrays of hues. A good resource, where you will find the acronyms used widely, is colorbrewer
Data¶
# generate dummy data for 4 columns, and 5 full years, using month start date
df = pd.DataFrame(data=np.random.rand(12*5,4),
index=pd.date_range(start=pd.to_datetime(pd.datetime.now().year-5, format='%Y'), periods=12*5, freq='MS'),
columns=['A', 'B', 'C', 'D'])
# generate random categories and assign as categorical column
#categories = np.random.randint(0, 5, size=60)
#df = df.assign(category=pd.Series(categories, dtype='category').values)
df.info()
df.describe()
df.head()
df.tail()
Transform¶
Group data over time periods. Refer to Pandas Offset Aliases
# 1 year periods by year end
df1 = df.groupby(pd.TimeGrouper('1A')).sum()
df1.shape
# 1 quarter periods by quarter end
df2 = df.groupby(pd.TimeGrouper('1Q')).sum()
df2.shape
Visualizations¶
There are many libraries for visualizing data in Python. Most rely on Matpltlib, however there are some convenience advantages in not using matplotlib.
For example Matplotlib requires more code to make a simple chart, which can be done via the Pandas and Seaborn API in a single line.
Below are some basics of Pandas, Matplotlib and Seaborn in regards to where each library shines.
Pandas Plot Basics¶
BoxPlot¶
df.boxplot()
Matplotlib Basics¶
Multi Line Chart¶
fig, ax = plt.subplots(figsize=(12,8))
df1.plot(ax=ax, colormap='Spectral')
ax.set(ylabel='Categories', xlabel='Time', title='Category Volume Over 1 Year Periods')
# annotate data labels onto series lines
for series in df1.columns:
for x,y in zip(df1.index, df1[series]):
ax.annotate(str(round(y,2)), xy=(x,y+(.01*df1.values.max())))
fig.tight_layout(pad=2)
fig.savefig('img/category_volume_over_time_multi_line_chart.png')
Stacked Bar Chart¶
fig, ax = plt.subplots(figsize=(12,8))
df2.plot(kind='bar', stacked=True, ax=ax, colormap='Spectral')
ax.set(ylabel='Volume', xlabel='Time', title='Category Volume Over 6 Month Periods')
# auto format xaxis labels as date
fig.autofmt_xdate()
# custom format xaxis date labels
ax.xaxis.set_major_formatter(plt.FixedFormatter(df2.index.to_series().dt.strftime('%b %Y')))
# annotate data labels onto vertical bars
for bar,(col,ix) in zip(ax.patches, pd.MultiIndex.from_product([df2.columns,df2.index])):
label = '{:,.2f}'.format(df2.loc[ix,col])
stack = df2.iloc[df2.index.get_loc(ix),:df2.columns.get_loc(col)].sum() if len(df2.iloc[df2.index.get_loc(ix),:df2.columns.get_loc(col)]) > 0 else 0
ax.text(s=label, x=bar.get_x()+(bar.get_width()/2), y=stack+bar.get_height()-(.05*bar.get_height()), ha='center', va='top', fontdict={'fontsize':10, 'color':'white'})
fig.tight_layout(pad=2)
fig.savefig('img/category_volume_over_time_stacked_bar_chart.png')
Advanced Matplotlib¶
Custom Bar Chart¶
fig, ax = plt.subplots(figsize=(12,8))
series = len(df1.columns)
groups = len(df1.index)
bars = series * groups
width = .90 / series
bar_offset = series * width / 2
# bars
for i,(col,values) in enumerate(df1.iteritems()):
s = df1.columns.get_loc(col) + 1
rect = ax.bar(np.arange(groups)+(s*width)-bar_offset, list(values), width=width,
color=mpl.colors.rgb2hex(mpl.cm.get_cmap('Spectral',series)(i)[:3]))
# ticks
ax.set_xticks(np.arange(groups)+width)
ax.set_xticklabels(df1.index.strftime('%b %Y'))
ax.tick_params(axis='both', which='both', direction='out', length=6, width=2,
left='on', right='off', top='off', bottom='on',
labelsize=12)
# auto format xaxis labels as date
fig.autofmt_xdate()
ax.set(ylabel='Volume', xlabel='Time', title='Category Volume per Year End')
# labels
for bar,(col,ix) in zip(ax.patches, pd.MultiIndex.from_product([df1.columns,df1.index])):
label = '{:,.2f}'.format(df1.loc[ix,col])
ax.text(s=label, x=bar.get_x()+(bar.get_width()/2), y=bar.get_height()+(.05*df1.values.max()), ha='center', va='bottom', fontdict={'fontsize':10})
# legend
handles, labels = ax.containers, list(df1.columns)
lgd = ax.legend(handles, labels, loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=4, fontsize=12, frameon=False)
# centered on top
#lgd = ax.legend(handles, labels, loc='lower center', bbox_to_anchor=(0,1.02,1,0.2), ncol=4, fontsize=12, frameon=False)
# remove right and left figure border
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
fig = plt.gcf()
fig.tight_layout()
fig.subplots_adjust(top=.9, bottom=.2)
fig.savefig('img/category_volume_per_year_end_custom.png')
Seaborn¶
Seaborn Histogram¶
df.A.value_counts(bins=10)
ax = sns.distplot(df.A, bins=10)
Seaborn Scatterplot¶
ax = sns.regplot(x="A", y="B", data=df)
plot = sns.jointplot(df.A, df.B, kind='scatter')
plot = sns.pairplot(df)
Seaborn Heatmap¶
df.corr()
ax = sns.heatmap(df.corr(), square=True, cmap='RdYlBu')
Resources¶
Chris Moffitt at Practical Business Python has a great tutorial and helpful infographic on matplotlib