pandas_process_XML

How To XML to Pandas

This notebook will demonstrate how to process XML data in python using pandas. Specifically, we use a library xmltodict, to parse XML as a dictionary, then read what we want into Pandas.

We then show how to clean up various data types and nested columns into a format which can be used in analysis.

NOTE: The example data provided is very basic, which contains little numerical data for analysis. However the methodology should be able to be copied to a more quantitative analysis (e.g. statistics over reporting periods)

Get XML

For fun let's use the RSS feed of this site

In [1]:
import requests
In [2]:
response = requests.get('https://knanne.github.io/feed.xml')
In [3]:
response.ok
Out[3]:
True

xmltodict

This is a python library on pip. Install with pip install xmltodict

In [4]:
import xmltodict
import json

Parse the XML

In [5]:
%%time
d = xmltodict.parse(response.content)
Wall time: 9.83 ms

Preview the JSON

In [6]:
print(json.dumps(d, indent=2)[:750])
print('.\n'*3)
{
  "feed": {
    "@xmlns": "http://www.w3.org/2005/Atom",
    "generator": {
      "@uri": "https://jekyllrb.com/",
      "@version": "3.6.2",
      "#text": "Jekyll"
    },
    "link": [
      {
        "@href": "https://knanne.github.io/feed.xml",
        "@rel": "self",
        "@type": "application/atom+xml"
      },
      {
        "@href": "https://knanne.github.io/",
        "@rel": "alternate",
        "@type": "text/html"
      }
    ],
    "updated": "2018-01-27T15:24:01+00:00",
    "id": "https://knanne.github.io/",
    "title": {
      "@type": "html",
      "#text": "knanne"
    },
    "subtitle": "Expert traveler, amateur farmer, beer / wine enthusiast - with a data science problem",
    "author": {
      "name": "Kain Nanne"
.
.
.

Convert to Pandas DataFrame

Now this still requires some manual entry.

Since the JSON is still a series of nested dictionaries, we need to identify what are the recurring "entries" that we are looking for. (e.g. documents, posts, entries etc.)

If you want to extract metadata from roots of your entries, I would suggest creating another DataFrame for that and merging your "entries" database on your "metadata" database later.

In [9]:
import pandas as pd
In [10]:
df = pd.DataFrame.from_dict(d['feed']['entry'])
In [11]:
df.shape
Out[11]:
(10, 9)
In [12]:
df.columns
Out[12]:
Index(['title', 'link', 'published', 'updated', 'id', 'content', 'author',
       'category', 'summary'],
      dtype='object')
In [13]:
df.head()
Out[13]:
title link published updated id content author category summary
0 {'@type': 'html', '#text': 'Notes On Pyspark'} {'@href': 'https://knanne.github.io/posts/note... 2017-12-27T00:00:00+00:00 2017-12-27T00:00:00+00:00 https://knanne.github.io/posts/notes-on-pyspark {'@type': 'html', '@xml:base': 'https://knanne... {'name': 'Kain Nanne'} [{'@term': 'pyspark'}, {'@term': 'spark'}, {'@... {'@type': 'html', '#text': 'Random notes, link...
1 {'@type': 'html', '#text': 'Random Resources F... {'@href': 'https://knanne.github.io/posts/rand... 2017-11-22T00:00:00+00:00 2017-11-22T00:00:00+00:00 https://knanne.github.io/posts/random-resource... {'@type': 'html', '@xml:base': 'https://knanne... {'name': 'Kain Nanne'} NaN {'@type': 'html', '#text': 'Random notes, link...
2 {'@type': 'html', '#text': 'Preprocessing Data... {'@href': 'https://knanne.github.io/posts/prep... 2017-11-11T00:00:00+00:00 2017-11-11T00:00:00+00:00 https://knanne.github.io/posts/preprocessing-d... {'@type': 'html', '@xml:base': 'https://knanne... {'name': 'Kain Nanne'} [{'@term': 'python'}, {'@term': 'pandas'}] {'@type': 'html', '#text': 'Notes on preparing...
3 {'@type': 'html', '#text': 'Notes On Regular E... {'@href': 'https://knanne.github.io/posts/note... 2017-10-04T00:00:00+00:00 2017-10-04T00:00:00+00:00 https://knanne.github.io/posts/notes-on-regula... {'@type': 'html', '@xml:base': 'https://knanne... {'name': 'Kain Nanne'} {'@term': 'regex'} {'@type': 'html', '#text': 'Notes and scripts ...
4 {'@type': 'html', '#text': 'Cleaning Data In P... {'@href': 'https://knanne.github.io/posts/clea... 2017-09-24T00:00:00+00:00 2017-09-24T00:00:00+00:00 https://knanne.github.io/posts/cleaning-data-i... {'@type': 'html', '@xml:base': 'https://knanne... {'name': 'Kain Nanne'} [{'@term': 'python'}, {'@term': 'pandas'}] {'@type': 'html', '#text': 'Helpful scripts fo...

Cleaning

Okay, now you may notice how dirty the above DataFrame may look. In fact it is quite detailed, which is very nice, however we will need to transform some columns if we want to apply analysis on them.

Column with Dictionary (single key)

Consider the "author" column. It is simply a dictionary with a single key, value pair. We can access it simply by applying .get("key") to the column.

In [14]:
df['author'].apply(lambda x: x.get('name'))
Out[14]:
0    Kain Nanne
1    Kain Nanne
2    Kain Nanne
3    Kain Nanne
4    Kain Nanne
5    Kain Nanne
6    Kain Nanne
7    Kain Nanne
8    Kain Nanne
9    Kain Nanne
Name: author, dtype: object

More efficient though, we can convert the column to a pd.Series, which takes care of extracting the value and renaming the column to the key, and simply join it back to our original DataFrame.

In [15]:
df.author.apply(pd.Series)
Out[15]:
name
0 Kain Nanne
1 Kain Nanne
2 Kain Nanne
3 Kain Nanne
4 Kain Nanne
5 Kain Nanne
6 Kain Nanne
7 Kain Nanne
8 Kain Nanne
9 Kain Nanne
In [16]:
df = df.join(df.author.apply(pd.Series), how='left')

Preview the last column to verify that worked.

In [17]:
df.shape
Out[17]:
(10, 10)
In [18]:
df.head()
Out[18]:
title link published updated id content author category summary name
0 {'@type': 'html', '#text': 'Notes On Pyspark'} {'@href': 'https://knanne.github.io/posts/note... 2017-12-27T00:00:00+00:00 2017-12-27T00:00:00+00:00 https://knanne.github.io/posts/notes-on-pyspark {'@type': 'html', '@xml:base': 'https://knanne... {'name': 'Kain Nanne'} [{'@term': 'pyspark'}, {'@term': 'spark'}, {'@... {'@type': 'html', '#text': 'Random notes, link... Kain Nanne
1 {'@type': 'html', '#text': 'Random Resources F... {'@href': 'https://knanne.github.io/posts/rand... 2017-11-22T00:00:00+00:00 2017-11-22T00:00:00+00:00 https://knanne.github.io/posts/random-resource... {'@type': 'html', '@xml:base': 'https://knanne... {'name': 'Kain Nanne'} NaN {'@type': 'html', '#text': 'Random notes, link... Kain Nanne
2 {'@type': 'html', '#text': 'Preprocessing Data... {'@href': 'https://knanne.github.io/posts/prep... 2017-11-11T00:00:00+00:00 2017-11-11T00:00:00+00:00 https://knanne.github.io/posts/preprocessing-d... {'@type': 'html', '@xml:base': 'https://knanne... {'name': 'Kain Nanne'} [{'@term': 'python'}, {'@term': 'pandas'}] {'@type': 'html', '#text': 'Notes on preparing... Kain Nanne
3 {'@type': 'html', '#text': 'Notes On Regular E... {'@href': 'https://knanne.github.io/posts/note... 2017-10-04T00:00:00+00:00 2017-10-04T00:00:00+00:00 https://knanne.github.io/posts/notes-on-regula... {'@type': 'html', '@xml:base': 'https://knanne... {'name': 'Kain Nanne'} {'@term': 'regex'} {'@type': 'html', '#text': 'Notes and scripts ... Kain Nanne
4 {'@type': 'html', '#text': 'Cleaning Data In P... {'@href': 'https://knanne.github.io/posts/clea... 2017-09-24T00:00:00+00:00 2017-09-24T00:00:00+00:00 https://knanne.github.io/posts/cleaning-data-i... {'@type': 'html', '@xml:base': 'https://knanne... {'name': 'Kain Nanne'} [{'@term': 'python'}, {'@term': 'pandas'}] {'@type': 'html', '#text': 'Helpful scripts fo... Kain Nanne

Column with Dictionary (multiple keys)

Consider the "title" column, which contains a single dictionary with multiple keys. Applying the same method as above, works well and explodes both keys with their respective values.

In [19]:
df.title.apply(pd.Series)
Out[19]:
@type #text
0 html Notes On Pyspark
1 html Random Resources For Data Analysis
2 html Preprocessing Data In Pandas
3 html Notes On Regular Expression
4 html Cleaning Data In Pandas
5 html Automating Everything In Python
6 html Notes On Sql
7 html Notable Open Source Tools
8 html Calculations For Data Analysis
9 html How To Visualize A Resume In Tableau
In [20]:
df = df.join(df.title.apply(pd.Series), how='left')
In [21]:
df.shape
Out[21]:
(10, 12)
In [22]:
df.head()
Out[22]:
title link published updated id content author category summary name @type #text
0 {'@type': 'html', '#text': 'Notes On Pyspark'} {'@href': 'https://knanne.github.io/posts/note... 2017-12-27T00:00:00+00:00 2017-12-27T00:00:00+00:00 https://knanne.github.io/posts/notes-on-pyspark {'@type': 'html', '@xml:base': 'https://knanne... {'name': 'Kain Nanne'} [{'@term': 'pyspark'}, {'@term': 'spark'}, {'@... {'@type': 'html', '#text': 'Random notes, link... Kain Nanne html Notes On Pyspark
1 {'@type': 'html', '#text': 'Random Resources F... {'@href': 'https://knanne.github.io/posts/rand... 2017-11-22T00:00:00+00:00 2017-11-22T00:00:00+00:00 https://knanne.github.io/posts/random-resource... {'@type': 'html', '@xml:base': 'https://knanne... {'name': 'Kain Nanne'} NaN {'@type': 'html', '#text': 'Random notes, link... Kain Nanne html Random Resources For Data Analysis
2 {'@type': 'html', '#text': 'Preprocessing Data... {'@href': 'https://knanne.github.io/posts/prep... 2017-11-11T00:00:00+00:00 2017-11-11T00:00:00+00:00 https://knanne.github.io/posts/preprocessing-d... {'@type': 'html', '@xml:base': 'https://knanne... {'name': 'Kain Nanne'} [{'@term': 'python'}, {'@term': 'pandas'}] {'@type': 'html', '#text': 'Notes on preparing... Kain Nanne html Preprocessing Data In Pandas
3 {'@type': 'html', '#text': 'Notes On Regular E... {'@href': 'https://knanne.github.io/posts/note... 2017-10-04T00:00:00+00:00 2017-10-04T00:00:00+00:00 https://knanne.github.io/posts/notes-on-regula... {'@type': 'html', '@xml:base': 'https://knanne... {'name': 'Kain Nanne'} {'@term': 'regex'} {'@type': 'html', '#text': 'Notes and scripts ... Kain Nanne html Notes On Regular Expression
4 {'@type': 'html', '#text': 'Cleaning Data In P... {'@href': 'https://knanne.github.io/posts/clea... 2017-09-24T00:00:00+00:00 2017-09-24T00:00:00+00:00 https://knanne.github.io/posts/cleaning-data-i... {'@type': 'html', '@xml:base': 'https://knanne... {'name': 'Kain Nanne'} [{'@term': 'python'}, {'@term': 'pandas'}] {'@type': 'html', '#text': 'Helpful scripts fo... Kain Nanne html Cleaning Data In Pandas

Clean Up Columns

Of course at any time, including when joining, you can drop and rename columns to make the necessary clarifications.

In [23]:
df = df.drop(['author','title'], axis=1)\
    .rename(columns={'name':'author', '#text':'title'})

Data Types

Similarily, data types can be converted during the above processing when need be. For example, on dates

In [24]:
df.loc[:,'published'] = pd.to_datetime(df.published)

Column with List of Dictionaries

Now consider the column category which is a list of dictionaries containing all the categories associated with a post. However, if there is only 1 category the data type is a dictionary, and if there are none then it is NULL (np.nan). So what happens if we simply try to explode this column into a pd.Series?

In [25]:
df.category.apply(pd.Series)
C:\Users\Kain\Anaconda3\lib\site-packages\pandas\core\indexes\api.py:77: RuntimeWarning: '<' not supported between instances of 'str' and 'int', sort order is undefined for incomparable objects
  result = result.union(other)
C:\Users\Kain\Anaconda3\lib\site-packages\pandas\core\indexes\api.py:43: RuntimeWarning: '<' not supported between instances of 'str' and 'int', sort order is undefined for incomparable objects
  union = _union_indexes(indexes)
Out[25]:
0 1 2 3 @term
0 {'@term': 'pyspark'} {'@term': 'spark'} {'@term': 'python'} {'@term': 'databricks'} NaN
1 NaN NaN NaN NaN NaN
2 {'@term': 'python'} {'@term': 'pandas'} NaN NaN NaN
3 NaN NaN NaN NaN regex
4 {'@term': 'python'} {'@term': 'pandas'} NaN NaN NaN
5 NaN NaN NaN NaN python
6 {'@term': 'sql'} {'@term': 'mysql'} {'@term': 'postgresql'} {'@term': 'oracle'} NaN
7 NaN NaN NaN NaN software
8 {'@term': 'calculations'} {'@term': 'equations'} NaN NaN NaN
9 {'@term': 'tableau'} {'@term': 'resume'} NaN NaN NaN

The above shows there something inconsitent happening with all our different data types. So instead let's wrap the single dictionaries in [] to convert them to lists, to standardize our data.

In [26]:
df.category.apply(lambda x: pd.Series(x) if isinstance(x,list) else pd.Series([x]))
Out[26]:
0 1 2 3
0 {'@term': 'pyspark'} {'@term': 'spark'} {'@term': 'python'} {'@term': 'databricks'}
1 NaN NaN NaN NaN
2 {'@term': 'python'} {'@term': 'pandas'} NaN NaN
3 {'@term': 'regex'} NaN NaN NaN
4 {'@term': 'python'} {'@term': 'pandas'} NaN NaN
5 {'@term': 'python'} NaN NaN NaN
6 {'@term': 'sql'} {'@term': 'mysql'} {'@term': 'postgresql'} {'@term': 'oracle'}
7 {'@term': 'software'} NaN NaN NaN
8 {'@term': 'calculations'} {'@term': 'equations'} NaN NaN
9 {'@term': 'tableau'} {'@term': 'resume'} NaN NaN

Okay not bad, now let's stack this data, rename the column, extract the actual value, and add the term index as a second column.

In [27]:
exploded_column = df.category.apply(lambda x: pd.Series(x) if isinstance(x,list) else pd.Series([x]))\
    .stack()\
    .rename('term')\
    .apply(lambda x: x['@term'])\
    .to_frame()\
    .reset_index(level=1)\
    .rename(columns={'level_1':'term_index'})
In [28]:
exploded_column
Out[28]:
term_index term
0 0 pyspark
0 1 spark
0 2 python
0 3 databricks
2 0 python
2 1 pandas
3 0 regex
4 0 python
4 1 pandas
5 0 python
6 0 sql
6 1 mysql
6 2 postgresql
6 3 oracle
7 0 software
8 0 calculations
8 1 equations
9 0 tableau
9 1 resume

Now join this back to our original dataframe to add these categories for each post. (FYI: this JOIN will multiply your data of course, so beware with large datasets)

In [29]:
df = df.join(exploded_column, how='left')
In [30]:
df.shape
Out[30]:
(20, 12)
In [31]:
df.head()
Out[31]:
link published updated id content category summary author @type title term_index term
0 {'@href': 'https://knanne.github.io/posts/note... 2017-12-27 2017-12-27T00:00:00+00:00 https://knanne.github.io/posts/notes-on-pyspark {'@type': 'html', '@xml:base': 'https://knanne... [{'@term': 'pyspark'}, {'@term': 'spark'}, {'@... {'@type': 'html', '#text': 'Random notes, link... Kain Nanne html Notes On Pyspark 0.0 pyspark
0 {'@href': 'https://knanne.github.io/posts/note... 2017-12-27 2017-12-27T00:00:00+00:00 https://knanne.github.io/posts/notes-on-pyspark {'@type': 'html', '@xml:base': 'https://knanne... [{'@term': 'pyspark'}, {'@term': 'spark'}, {'@... {'@type': 'html', '#text': 'Random notes, link... Kain Nanne html Notes On Pyspark 1.0 spark
0 {'@href': 'https://knanne.github.io/posts/note... 2017-12-27 2017-12-27T00:00:00+00:00 https://knanne.github.io/posts/notes-on-pyspark {'@type': 'html', '@xml:base': 'https://knanne... [{'@term': 'pyspark'}, {'@term': 'spark'}, {'@... {'@type': 'html', '#text': 'Random notes, link... Kain Nanne html Notes On Pyspark 2.0 python
0 {'@href': 'https://knanne.github.io/posts/note... 2017-12-27 2017-12-27T00:00:00+00:00 https://knanne.github.io/posts/notes-on-pyspark {'@type': 'html', '@xml:base': 'https://knanne... [{'@term': 'pyspark'}, {'@term': 'spark'}, {'@... {'@type': 'html', '#text': 'Random notes, link... Kain Nanne html Notes On Pyspark 3.0 databricks
1 {'@href': 'https://knanne.github.io/posts/rand... 2017-11-22 2017-11-22T00:00:00+00:00 https://knanne.github.io/posts/random-resource... {'@type': 'html', '@xml:base': 'https://knanne... NaN {'@type': 'html', '#text': 'Random notes, link... Kain Nanne html Random Resources For Data Analysis NaN NaN

Analyze

Now that the data has been processed in a way that is familiar, we can perform typical anlysis

In [32]:
df.groupby([df.published.dt.year, df.published.dt.month]).agg({'title':'count'})
Out[32]:
title
published published
2017 3 2
6 7
9 3
10 1
11 3
12 4
In [33]:
df.groupby('title').agg({'term': 'nunique'})
Out[33]:
term
title
Automating Everything In Python 1
Calculations For Data Analysis 2
Cleaning Data In Pandas 2
How To Visualize A Resume In Tableau 2
Notable Open Source Tools 1
Notes On Pyspark 4
Notes On Regular Expression 1
Notes On Sql 4
Preprocessing Data In Pandas 2
Random Resources For Data Analysis 0