How To XML to Pandas¶

This notebook will demonstrate how to process XML data in python using pandas. Specifically, we use a library xmltodict, to parse XML as a dictionary, then read what we want into Pandas.

We then show how to clean up various data types and nested columns into a format which can be used in analysis.

NOTE: The example data provided is very basic, which contains little numerical data for analysis. However the methodology should be able to be copied to a more quantitative analysis (e.g. statistics over reporting periods)

Get XML¶

For fun let's use the RSS feed of this site

import requests

response = requests.get('https://knanne.github.io/feed.xml')

response.ok

True

xmltodict¶

This is a python library on pip. Install with pip install xmltodict

import xmltodict
import json

Parse the XML¶

%%time
d = xmltodict.parse(response.content)

Wall time: 9.83 ms

Preview the JSON¶

print(json.dumps(d, indent=2)[:750])
print('.\n'*3)

{
  "feed": {
    "@xmlns": "http://www.w3.org/2005/Atom",
    "generator": {
      "@uri": "https://jekyllrb.com/",
      "@version": "3.6.2",
      "#text": "Jekyll"
    },
    "link": [
      {
        "@href": "https://knanne.github.io/feed.xml",
        "@rel": "self",
        "@type": "application/atom+xml"
      },
      {
        "@href": "https://knanne.github.io/",
        "@rel": "alternate",
        "@type": "text/html"
      }
    ],
    "updated": "2018-01-27T15:24:01+00:00",
    "id": "https://knanne.github.io/",
    "title": {
      "@type": "html",
      "#text": "knanne"
    },
    "subtitle": "Expert traveler, amateur farmer, beer / wine enthusiast - with a data science problem",
    "author": {
      "name": "Kain Nanne"
.
.
.

Convert to Pandas DataFrame¶

Now this still requires some manual entry.

Since the JSON is still a series of nested dictionaries, we need to identify what are the recurring "entries" that we are looking for. (e.g. documents, posts, entries etc.)

If you want to extract metadata from roots of your entries, I would suggest creating another DataFrame for that and merging your "entries" database on your "metadata" database later.

import pandas as pd

df = pd.DataFrame.from_dict(d['feed']['entry'])

df.shape

(10, 9)

df.columns

Index(['title', 'link', 'published', 'updated', 'id', 'content', 'author',
       'category', 'summary'],
      dtype='object')

df.head()

Cleaning¶

Okay, now you may notice how dirty the above DataFrame may look. In fact it is quite detailed, which is very nice, however we will need to transform some columns if we want to apply analysis on them.

Column with Dictionary (single key)¶

Consider the "author" column. It is simply a dictionary with a single key, value pair. We can access it simply by applying .get("key") to the column.

df['author'].apply(lambda x: x.get('name'))

0    Kain Nanne
1    Kain Nanne
2    Kain Nanne
3    Kain Nanne
4    Kain Nanne
5    Kain Nanne
6    Kain Nanne
7    Kain Nanne
8    Kain Nanne
9    Kain Nanne
Name: author, dtype: object

More efficient though, we can convert the column to a pd.Series, which takes care of extracting the value and renaming the column to the key, and simply join it back to our original DataFrame.

df.author.apply(pd.Series)

df = df.join(df.author.apply(pd.Series), how='left')

Preview the last column to verify that worked.

df.shape

(10, 10)

df.head()

Column with Dictionary (multiple keys)¶

Consider the "title" column, which contains a single dictionary with multiple keys. Applying the same method as above, works well and explodes both keys with their respective values.

df.title.apply(pd.Series)

df = df.join(df.title.apply(pd.Series), how='left')

df.shape

(10, 12)

df.head()

Clean Up Columns¶

Of course at any time, including when joining, you can drop and rename columns to make the necessary clarifications.

df = df.drop(['author','title'], axis=1)\
    .rename(columns={'name':'author', '#text':'title'})

Data Types¶

Similarily, data types can be converted during the above processing when need be. For example, on dates

df.loc[:,'published'] = pd.to_datetime(df.published)

Column with List of Dictionaries¶

Now consider the column category which is a list of dictionaries containing all the categories associated with a post. However, if there is only 1 category the data type is a dictionary, and if there are none then it is NULL (np.nan). So what happens if we simply try to explode this column into a pd.Series?

df.category.apply(pd.Series)

C:\Users\Kain\Anaconda3\lib\site-packages\pandas\core\indexes\api.py:77: RuntimeWarning: '<' not supported between instances of 'str' and 'int', sort order is undefined for incomparable objects
  result = result.union(other)
C:\Users\Kain\Anaconda3\lib\site-packages\pandas\core\indexes\api.py:43: RuntimeWarning: '<' not supported between instances of 'str' and 'int', sort order is undefined for incomparable objects
  union = _union_indexes(indexes)

The above shows there something inconsitent happening with all our different data types. So instead let's wrap the single dictionaries in [] to convert them to lists, to standardize our data.

df.category.apply(lambda x: pd.Series(x) if isinstance(x,list) else pd.Series([x]))

Okay not bad, now let's stack this data, rename the column, extract the actual value, and add the term index as a second column.

exploded_column = df.category.apply(lambda x: pd.Series(x) if isinstance(x,list) else pd.Series([x]))\
    .stack()\
    .rename('term')\
    .apply(lambda x: x['@term'])\
    .to_frame()\
    .reset_index(level=1)\
    .rename(columns={'level_1':'term_index'})

exploded_column

Now join this back to our original dataframe to add these categories for each post. (FYI: this JOIN will multiply your data of course, so beware with large datasets)

df = df.join(exploded_column, how='left')

df.shape

(20, 12)

df.head()

Analyze¶

Now that the data has been processed in a way that is familiar, we can perform typical anlysis

df.groupby([df.published.dt.year, df.published.dt.month]).agg({'title':'count'})

df.groupby('title').agg({'term': 'nunique'})

	title	link	published	updated	id	content	author	category	summary
0	{'@type': 'html', '#text': 'Notes On Pyspark'}	{'@href': 'https://knanne.github.io/posts/note...	2017-12-27T00:00:00+00:00	2017-12-27T00:00:00+00:00	https://knanne.github.io/posts/notes-on-pyspark	{'@type': 'html', '@xml:base': 'https://knanne...	{'name': 'Kain Nanne'}	[{'@term': 'pyspark'}, {'@term': 'spark'}, {'@...	{'@type': 'html', '#text': 'Random notes, link...
1	{'@type': 'html', '#text': 'Random Resources F...	{'@href': 'https://knanne.github.io/posts/rand...	2017-11-22T00:00:00+00:00	2017-11-22T00:00:00+00:00	https://knanne.github.io/posts/random-resource...	{'@type': 'html', '@xml:base': 'https://knanne...	{'name': 'Kain Nanne'}	NaN	{'@type': 'html', '#text': 'Random notes, link...
2	{'@type': 'html', '#text': 'Preprocessing Data...	{'@href': 'https://knanne.github.io/posts/prep...	2017-11-11T00:00:00+00:00	2017-11-11T00:00:00+00:00	https://knanne.github.io/posts/preprocessing-d...	{'@type': 'html', '@xml:base': 'https://knanne...	{'name': 'Kain Nanne'}	[{'@term': 'python'}, {'@term': 'pandas'}]	{'@type': 'html', '#text': 'Notes on preparing...
3	{'@type': 'html', '#text': 'Notes On Regular E...	{'@href': 'https://knanne.github.io/posts/note...	2017-10-04T00:00:00+00:00	2017-10-04T00:00:00+00:00	https://knanne.github.io/posts/notes-on-regula...	{'@type': 'html', '@xml:base': 'https://knanne...	{'name': 'Kain Nanne'}	{'@term': 'regex'}	{'@type': 'html', '#text': 'Notes and scripts ...
4	{'@type': 'html', '#text': 'Cleaning Data In P...	{'@href': 'https://knanne.github.io/posts/clea...	2017-09-24T00:00:00+00:00	2017-09-24T00:00:00+00:00	https://knanne.github.io/posts/cleaning-data-i...	{'@type': 'html', '@xml:base': 'https://knanne...	{'name': 'Kain Nanne'}	[{'@term': 'python'}, {'@term': 'pandas'}]	{'@type': 'html', '#text': 'Helpful scripts fo...

	title	link	published	updated	id	content	author	category	summary	name
0	{'@type': 'html', '#text': 'Notes On Pyspark'}	{'@href': 'https://knanne.github.io/posts/note...	2017-12-27T00:00:00+00:00	2017-12-27T00:00:00+00:00	https://knanne.github.io/posts/notes-on-pyspark	{'@type': 'html', '@xml:base': 'https://knanne...	{'name': 'Kain Nanne'}	[{'@term': 'pyspark'}, {'@term': 'spark'}, {'@...	{'@type': 'html', '#text': 'Random notes, link...	Kain Nanne
1	{'@type': 'html', '#text': 'Random Resources F...	{'@href': 'https://knanne.github.io/posts/rand...	2017-11-22T00:00:00+00:00	2017-11-22T00:00:00+00:00	https://knanne.github.io/posts/random-resource...	{'@type': 'html', '@xml:base': 'https://knanne...	{'name': 'Kain Nanne'}	NaN	{'@type': 'html', '#text': 'Random notes, link...	Kain Nanne
2	{'@type': 'html', '#text': 'Preprocessing Data...	{'@href': 'https://knanne.github.io/posts/prep...	2017-11-11T00:00:00+00:00	2017-11-11T00:00:00+00:00	https://knanne.github.io/posts/preprocessing-d...	{'@type': 'html', '@xml:base': 'https://knanne...	{'name': 'Kain Nanne'}	[{'@term': 'python'}, {'@term': 'pandas'}]	{'@type': 'html', '#text': 'Notes on preparing...	Kain Nanne
3	{'@type': 'html', '#text': 'Notes On Regular E...	{'@href': 'https://knanne.github.io/posts/note...	2017-10-04T00:00:00+00:00	2017-10-04T00:00:00+00:00	https://knanne.github.io/posts/notes-on-regula...	{'@type': 'html', '@xml:base': 'https://knanne...	{'name': 'Kain Nanne'}	{'@term': 'regex'}	{'@type': 'html', '#text': 'Notes and scripts ...	Kain Nanne
4	{'@type': 'html', '#text': 'Cleaning Data In P...	{'@href': 'https://knanne.github.io/posts/clea...	2017-09-24T00:00:00+00:00	2017-09-24T00:00:00+00:00	https://knanne.github.io/posts/cleaning-data-i...	{'@type': 'html', '@xml:base': 'https://knanne...	{'name': 'Kain Nanne'}	[{'@term': 'python'}, {'@term': 'pandas'}]	{'@type': 'html', '#text': 'Helpful scripts fo...	Kain Nanne

	title	link	published	updated	id	content	author	category	summary	name	@type	#text
0	{'@type': 'html', '#text': 'Notes On Pyspark'}	{'@href': 'https://knanne.github.io/posts/note...	2017-12-27T00:00:00+00:00	2017-12-27T00:00:00+00:00	https://knanne.github.io/posts/notes-on-pyspark	{'@type': 'html', '@xml:base': 'https://knanne...	{'name': 'Kain Nanne'}	[{'@term': 'pyspark'}, {'@term': 'spark'}, {'@...	{'@type': 'html', '#text': 'Random notes, link...	Kain Nanne	html	Notes On Pyspark
1	{'@type': 'html', '#text': 'Random Resources F...	{'@href': 'https://knanne.github.io/posts/rand...	2017-11-22T00:00:00+00:00	2017-11-22T00:00:00+00:00	https://knanne.github.io/posts/random-resource...	{'@type': 'html', '@xml:base': 'https://knanne...	{'name': 'Kain Nanne'}	NaN	{'@type': 'html', '#text': 'Random notes, link...	Kain Nanne	html	Random Resources For Data Analysis
2	{'@type': 'html', '#text': 'Preprocessing Data...	{'@href': 'https://knanne.github.io/posts/prep...	2017-11-11T00:00:00+00:00	2017-11-11T00:00:00+00:00	https://knanne.github.io/posts/preprocessing-d...	{'@type': 'html', '@xml:base': 'https://knanne...	{'name': 'Kain Nanne'}	[{'@term': 'python'}, {'@term': 'pandas'}]	{'@type': 'html', '#text': 'Notes on preparing...	Kain Nanne	html	Preprocessing Data In Pandas
3	{'@type': 'html', '#text': 'Notes On Regular E...	{'@href': 'https://knanne.github.io/posts/note...	2017-10-04T00:00:00+00:00	2017-10-04T00:00:00+00:00	https://knanne.github.io/posts/notes-on-regula...	{'@type': 'html', '@xml:base': 'https://knanne...	{'name': 'Kain Nanne'}	{'@term': 'regex'}	{'@type': 'html', '#text': 'Notes and scripts ...	Kain Nanne	html	Notes On Regular Expression
4	{'@type': 'html', '#text': 'Cleaning Data In P...	{'@href': 'https://knanne.github.io/posts/clea...	2017-09-24T00:00:00+00:00	2017-09-24T00:00:00+00:00	https://knanne.github.io/posts/cleaning-data-i...	{'@type': 'html', '@xml:base': 'https://knanne...	{'name': 'Kain Nanne'}	[{'@term': 'python'}, {'@term': 'pandas'}]	{'@type': 'html', '#text': 'Helpful scripts fo...	Kain Nanne	html	Cleaning Data In Pandas

	0	1	2	3	@term
0	{'@term': 'pyspark'}	{'@term': 'spark'}	{'@term': 'python'}	{'@term': 'databricks'}	NaN
1	NaN	NaN	NaN	NaN	NaN
2	{'@term': 'python'}	{'@term': 'pandas'}	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	regex
4	{'@term': 'python'}	{'@term': 'pandas'}	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	python
6	{'@term': 'sql'}	{'@term': 'mysql'}	{'@term': 'postgresql'}	{'@term': 'oracle'}	NaN
7	NaN	NaN	NaN	NaN	software
8	{'@term': 'calculations'}	{'@term': 'equations'}	NaN	NaN	NaN
9	{'@term': 'tableau'}	{'@term': 'resume'}	NaN	NaN	NaN

	0	1	2	3
0	{'@term': 'pyspark'}	{'@term': 'spark'}	{'@term': 'python'}	{'@term': 'databricks'}
1	NaN	NaN	NaN	NaN
2	{'@term': 'python'}	{'@term': 'pandas'}	NaN	NaN
3	{'@term': 'regex'}	NaN	NaN	NaN
4	{'@term': 'python'}	{'@term': 'pandas'}	NaN	NaN
5	{'@term': 'python'}	NaN	NaN	NaN
6	{'@term': 'sql'}	{'@term': 'mysql'}	{'@term': 'postgresql'}	{'@term': 'oracle'}
7	{'@term': 'software'}	NaN	NaN	NaN
8	{'@term': 'calculations'}	{'@term': 'equations'}	NaN	NaN
9	{'@term': 'tableau'}	{'@term': 'resume'}	NaN	NaN

	term_index	term
0	0	pyspark
0	1	spark
0	2	python
0	3	databricks
2	0	python
2	1	pandas
3	0	regex
4	0	python
4	1	pandas
5	0	python
6	0	sql
6	1	mysql
6	2	postgresql
6	3	oracle
7	0	software
8	0	calculations
8	1	equations
9	0	tableau
9	1	resume

	link	published	updated	id	content	category	summary	author	@type	title	term_index	term
0	{'@href': 'https://knanne.github.io/posts/note...	2017-12-27	2017-12-27T00:00:00+00:00	https://knanne.github.io/posts/notes-on-pyspark	{'@type': 'html', '@xml:base': 'https://knanne...	[{'@term': 'pyspark'}, {'@term': 'spark'}, {'@...	{'@type': 'html', '#text': 'Random notes, link...	Kain Nanne	html	Notes On Pyspark	0.0	pyspark
0	{'@href': 'https://knanne.github.io/posts/note...	2017-12-27	2017-12-27T00:00:00+00:00	https://knanne.github.io/posts/notes-on-pyspark	{'@type': 'html', '@xml:base': 'https://knanne...	[{'@term': 'pyspark'}, {'@term': 'spark'}, {'@...	{'@type': 'html', '#text': 'Random notes, link...	Kain Nanne	html	Notes On Pyspark	1.0	spark
0	{'@href': 'https://knanne.github.io/posts/note...	2017-12-27	2017-12-27T00:00:00+00:00	https://knanne.github.io/posts/notes-on-pyspark	{'@type': 'html', '@xml:base': 'https://knanne...	[{'@term': 'pyspark'}, {'@term': 'spark'}, {'@...	{'@type': 'html', '#text': 'Random notes, link...	Kain Nanne	html	Notes On Pyspark	2.0	python
0	{'@href': 'https://knanne.github.io/posts/note...	2017-12-27	2017-12-27T00:00:00+00:00	https://knanne.github.io/posts/notes-on-pyspark	{'@type': 'html', '@xml:base': 'https://knanne...	[{'@term': 'pyspark'}, {'@term': 'spark'}, {'@...	{'@type': 'html', '#text': 'Random notes, link...	Kain Nanne	html	Notes On Pyspark	3.0	databricks
1	{'@href': 'https://knanne.github.io/posts/rand...	2017-11-22	2017-11-22T00:00:00+00:00	https://knanne.github.io/posts/random-resource...	{'@type': 'html', '@xml:base': 'https://knanne...	NaN	{'@type': 'html', '#text': 'Random notes, link...	Kain Nanne	html	Random Resources For Data Analysis	NaN	NaN

	term
title
Automating Everything In Python	1
Calculations For Data Analysis	2
Cleaning Data In Pandas	2
How To Visualize A Resume In Tableau	2
Notable Open Source Tools	1
Notes On Pyspark	4
Notes On Regular Expression	1
Notes On Sql	4
Preprocessing Data In Pandas	2
Random Resources For Data Analysis	0