Pandas MultiIndex¶
The below examples are my summarized notes based on the Pandas docs on advanced indexing. The documentation on Pandas is probably the best in the whole Python community. If you want to use the full power of Pandas yourself, start reading!
import pandas as pd
import numpy as np
Data¶
categories = ['One', 'Two', 'Three', 'Four']
years = pd.date_range(end=pd.datetime.now(), periods=4, freq='A').year
idx = pd.MultiIndex.from_product([categories, years], names=['category','year'])
df = pd.DataFrame(data=np.random.rand(16,8),
index=idx,
columns=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'])
df
MultiIndex Columns¶
I recomend creating you multi-header as a list of tuples, then using pd.MultiIndex.from_tuples
, set it as the DataFrame columns.
columns = [('Group1' if i < 4 else 'Group2',col) for i,col in enumerate(df.columns)]
idx = pd.MultiIndex.from_tuples(columns, names=['group','subgroup'])
df.columns = idx
df
Slicing¶
Pandas already has great documenation on using slicers. Below if my simplified TLDR version.
To allow slicing, you must first sort the index.
For MultiIndex-ed objects to be indexed & sliced effectively, they need to be sorted. As with any index, you can use
sort_index
. - Pandas docs
df = df.sort_index()
For the most convenient and logical slicing, use the builtin pd.IndexSlice
tool
idx = pd.IndexSlice
Example slice on all levels avialble.
df.loc[idx[['One', 'Two'],[2014,2015]],idx[['Group1', 'Group2'],['C','D','E']]]
Slice on partial indices, and use :
for accessing all others. Below is warning from the docs.
Warning You should specify all axes in the .loc specifier, meaning the indexer for the index and for the columns. There are some ambiguous cases where the passed indexer could be mis-interpreted as indexing both axes, rather than into say the MuliIndex for the rows. - Pandas docs
df.loc[idx[:,[2016]],idx['Group2',:]]
Reindexing¶
I recommend first creating a new index from your individually altered components, then reindex all levels at one time.
new_categories = list(df.index.get_level_values('category').unique()) + ['Zero', 'Five']
new_categories
new_years = list(range(df.index.get_level_values('year').min()-2,
df.index.get_level_values('year').max()+2))
new_years
idx = pd.MultiIndex.from_product([new_categories, new_years])
df.reindex(idx, fill_value=0)