pandas_explode_column

Pandas Explode Column

This notebook demonstrates how to explode a column with nested values, either in CSV format or a dictionary (e.g. JSON).

NOTE: Method 3 of the CSV explosdion is the most efficient, and skip down to the Explode Dict Column for a super efficient way of exploding a dictionary of values in a Pandas DataFrame.

In [1]:
import pandas as pd

Data

In [2]:
df = pd.DataFrame({'csv' : ['value1',
                            'value1,value2',
                            'value1,value2,value3,value4,value5,value6',
                            'value1',
                            'value1,value2,value3,value4'],
                   'json' : [{'key1':'value1', 'key2':'value2.1', 'key3':'value3.1', 'key4':'value4.1'},
                             {'key1':'value2', 'key2':'value2.2', 'key3':'value3.2', 'key4':'value4.2'},
                             {'key1':'value3', 'key2':'value2.3', 'key3':'value3.3'},
                             {'key1':'value4', 'key2':'value2.4'},
                             {'key1':'value5', 'key2':'value2.5'}]
                  })
In [3]:
df.index.names = ['id']
In [4]:
df
Out[4]:
csv json
id
0 value1 {'key1': 'value1', 'key2': 'value2.1', 'key3':...
1 value1,value2 {'key1': 'value2', 'key2': 'value2.2', 'key3':...
2 value1,value2,value3,value4,value5,value6 {'key1': 'value3', 'key2': 'value2.3', 'key3':...
3 value1 {'key1': 'value4', 'key2': 'value2.4'}
4 value1,value2,value3,value4 {'key1': 'value5', 'key2': 'value2.5'}

Explode CSV Column

Method 1

In [5]:
data = {i:row.csv.split(',') for i,row in df.iterrows()}
In [6]:
df_exploded = pd.DataFrame.from_dict(data, orient='index').stack().to_frame()
In [7]:
df_exploded.columns = ['value']
df_exploded.index.names = ['id', 'csv_sequence']
In [8]:
df_exploded
Out[8]:
value
id csv_sequence
0 0 value1
1 0 value1
1 value2
2 0 value1
1 value2
2 value3
3 value4
4 value5
5 value6
3 0 value1
4 0 value1
1 value2
2 value3
3 value4

Method 2

In [9]:
df_exploded = df['csv'].str.split(pat=',', expand=True).stack().to_frame()
In [10]:
df_exploded.columns = ['value']
df_exploded.index.names = ['id', 'csv_sequence']
In [11]:
df_exploded
Out[11]:
value
id csv_sequence
0 0 value1
1 0 value1
1 value2
2 0 value1
1 value2
2 value3
3 value4
4 value5
5 value6
3 0 value1
4 0 value1
1 value2
2 value3
3 value4

Method 3

In [12]:
df_exploded = df.csv.apply(lambda x: pd.Series(x.split(','))).stack().to_frame()
In [13]:
df_exploded.columns = ['value']
df_exploded.index.names = ['id', 'csv_sequence']
In [14]:
df_exploded
Out[14]:
value
id csv_sequence
0 0 value1
1 0 value1
1 value2
2 0 value1
1 value2
2 value3
3 value4
4 value5
5 value6
3 0 value1
4 0 value1
1 value2
2 value3
3 value4

Join

In [15]:
df.join(df_exploded, how='left').reset_index(level='csv_sequence')
Out[15]:
csv_sequence csv json value
id
0 0 value1 {'key1': 'value1', 'key2': 'value2.1', 'key3':... value1
1 0 value1,value2 {'key1': 'value2', 'key2': 'value2.2', 'key3':... value1
1 1 value1,value2 {'key1': 'value2', 'key2': 'value2.2', 'key3':... value2
2 0 value1,value2,value3,value4,value5,value6 {'key1': 'value3', 'key2': 'value2.3', 'key3':... value1
2 1 value1,value2,value3,value4,value5,value6 {'key1': 'value3', 'key2': 'value2.3', 'key3':... value2
2 2 value1,value2,value3,value4,value5,value6 {'key1': 'value3', 'key2': 'value2.3', 'key3':... value3
2 3 value1,value2,value3,value4,value5,value6 {'key1': 'value3', 'key2': 'value2.3', 'key3':... value4
2 4 value1,value2,value3,value4,value5,value6 {'key1': 'value3', 'key2': 'value2.3', 'key3':... value5
2 5 value1,value2,value3,value4,value5,value6 {'key1': 'value3', 'key2': 'value2.3', 'key3':... value6
3 0 value1 {'key1': 'value4', 'key2': 'value2.4'} value1
4 0 value1,value2,value3,value4 {'key1': 'value5', 'key2': 'value2.5'} value1
4 1 value1,value2,value3,value4 {'key1': 'value5', 'key2': 'value2.5'} value2
4 2 value1,value2,value3,value4 {'key1': 'value5', 'key2': 'value2.5'} value3
4 3 value1,value2,value3,value4 {'key1': 'value5', 'key2': 'value2.5'} value4

Explode Dict Column

In [16]:
df
Out[16]:
csv json
id
0 value1 {'key1': 'value1', 'key2': 'value2.1', 'key3':...
1 value1,value2 {'key1': 'value2', 'key2': 'value2.2', 'key3':...
2 value1,value2,value3,value4,value5,value6 {'key1': 'value3', 'key2': 'value2.3', 'key3':...
3 value1 {'key1': 'value4', 'key2': 'value2.4'}
4 value1,value2,value3,value4 {'key1': 'value5', 'key2': 'value2.5'}
In [17]:
df.json.apply(pd.Series)
Out[17]:
key1 key2 key3 key4
id
0 value1 value2.1 value3.1 value4.1
1 value2 value2.2 value3.2 value4.2
2 value3 value2.3 value3.3 NaN
3 value4 value2.4 NaN NaN
4 value5 value2.5 NaN NaN
In [18]:
df.join(df.json.apply(pd.Series), how='left')
Out[18]:
csv json key1 key2 key3 key4
id
0 value1 {'key1': 'value1', 'key2': 'value2.1', 'key3':... value1 value2.1 value3.1 value4.1
1 value1,value2 {'key1': 'value2', 'key2': 'value2.2', 'key3':... value2 value2.2 value3.2 value4.2
2 value1,value2,value3,value4,value5,value6 {'key1': 'value3', 'key2': 'value2.3', 'key3':... value3 value2.3 value3.3 NaN
3 value1 {'key1': 'value4', 'key2': 'value2.4'} value4 value2.4 NaN NaN
4 value1,value2,value3,value4 {'key1': 'value5', 'key2': 'value2.5'} value5 value2.5 NaN NaN