How To XML to Pandas¶
This notebook will demonstrate how to process XML data in python using pandas. Specifically, we use a library xmltodict, to parse XML as a dictionary, then read what we want into Pandas.
We then show how to clean up various data types and nested columns into a format which can be used in analysis.
NOTE: The example data provided is very basic, which contains little numerical data for analysis. However the methodology should be able to be copied to a more quantitative analysis (e.g. statistics over reporting periods)
Get XML¶
For fun let's use the RSS feed of this site
import requests
response = requests.get('https://knanne.github.io/feed.xml')
response.ok
xmltodict¶
This is a python library on pip. Install with pip install xmltodict
import xmltodict
import json
Parse the XML¶
%%time
d = xmltodict.parse(response.content)
Preview the JSON¶
print(json.dumps(d, indent=2)[:750])
print('.\n'*3)
Convert to Pandas DataFrame¶
Now this still requires some manual entry.
Since the JSON is still a series of nested dictionaries, we need to identify what are the recurring "entries" that we are looking for. (e.g. documents, posts, entries etc.)
If you want to extract metadata from roots of your entries, I would suggest creating another DataFrame for that and merging your "entries" database on your "metadata" database later.
import pandas as pd
df = pd.DataFrame.from_dict(d['feed']['entry'])
df.shape
df.columns
df.head()
Cleaning¶
Okay, now you may notice how dirty the above DataFrame may look. In fact it is quite detailed, which is very nice, however we will need to transform some columns if we want to apply analysis on them.
Column with Dictionary (single key)¶
Consider the "author"
column. It is simply a dictionary with a single key, value pair. We can access it simply by applying .get("key")
to the column.
df['author'].apply(lambda x: x.get('name'))
More efficient though, we can convert the column to a pd.Series
, which takes care of extracting the value and renaming the column to the key, and simply join it back to our original DataFrame.
df.author.apply(pd.Series)
df = df.join(df.author.apply(pd.Series), how='left')
Preview the last column to verify that worked.
df.shape
df.head()
Column with Dictionary (multiple keys)¶
Consider the "title"
column, which contains a single dictionary with multiple keys. Applying the same method as above, works well and explodes both keys with their respective values.
df.title.apply(pd.Series)
df = df.join(df.title.apply(pd.Series), how='left')
df.shape
df.head()
Clean Up Columns¶
Of course at any time, including when joining, you can drop and rename columns to make the necessary clarifications.
df = df.drop(['author','title'], axis=1)\
.rename(columns={'name':'author', '#text':'title'})
Data Types¶
Similarily, data types can be converted during the above processing when need be. For example, on dates
df.loc[:,'published'] = pd.to_datetime(df.published)
Column with List of Dictionaries¶
Now consider the column category
which is a list of dictionaries containing all the categories associated with a post. However, if there is only 1 category the data type is a dictionary, and if there are none then it is NULL (np.nan
). So what happens if we simply try to explode this column into a pd.Series
?
df.category.apply(pd.Series)
The above shows there something inconsitent happening with all our different data types. So instead let's wrap the single dictionaries in []
to convert them to lists, to standardize our data.
df.category.apply(lambda x: pd.Series(x) if isinstance(x,list) else pd.Series([x]))
Okay not bad, now let's stack this data, rename the column, extract the actual value, and add the term index as a second column.
exploded_column = df.category.apply(lambda x: pd.Series(x) if isinstance(x,list) else pd.Series([x]))\
.stack()\
.rename('term')\
.apply(lambda x: x['@term'])\
.to_frame()\
.reset_index(level=1)\
.rename(columns={'level_1':'term_index'})
exploded_column
Now join this back to our original dataframe to add these categories for each post. (FYI: this JOIN will multiply your data of course, so beware with large datasets)
df = df.join(exploded_column, how='left')
df.shape
df.head()
Analyze¶
Now that the data has been processed in a way that is familiar, we can perform typical anlysis
df.groupby([df.published.dt.year, df.published.dt.month]).agg({'title':'count'})
df.groupby('title').agg({'term': 'nunique'})