pdf_data_mining

Sample PDF

Let's take a publicly accessible PDF as a sample, and for fun let's use my Master's thesis.

In [1]:
import urllib.request
import shutil
In [2]:
url = 'https://raw.githubusercontent.com/knanne/vu_msc_tweetsumm/master/research/KainNanne_MSc_Thesis_ACM.pdf'
In [3]:
with urllib.request.urlopen(url) as response, open('sample.pdf', 'wb') as out_file:
    shutil.copyfileobj(response, out_file)
In [4]:
file = 'sample.pdf'

PDF to Text using pdfminer

Below is a funtion to convert the file to text. Source Credit: https://stackoverflow.com/a/26495057/5356898

In [5]:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(file):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(file, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()
    
    text = text.replace('  ',' ').replace('  ',' ')

    fp.close()
    device.close()
    retstr.close()
    return text
In [6]:
pdf = convert_pdf_to_txt(file)

Now that we have the text of the PDF document as a single string, you may want to apply some fance regular expression to split and parse the text as you wish

In [7]:
pdf[:2500]
Out[7]:
'Automating Open Domain Event Summaries by\n\nHarnessing Collective Reactions on Twitter\n\nKain Nanne\n\n∗\n\nVU University Amsterdam\nk.nanne@student.vu.nl\n\nABSTRACT\nMicroblogging sites have become popular platforms for on-\nline news reporting as well as socially participating in and\ninteracting with the discussion of real-time events. This pa-\nper researches an automated solution to the inability of a\nhuman to wholly consume and comprehend the vast amount\nof data surrounding topics online. We introduce the Collec-\ntive Reactions for Event Summarization (CRES) approach,\nwhich uses an original combination of proven algorithms to\nharness signals in online activity, social interactions, con-\ntent metadata, and language overlap to build comprehen-\nsive summaries of events through collective reactions from\nthe crowd. The methodology is open sourced as an end-to-\nend framework exemplified using twelve open domain events.\nOur experiments consider the two questions of: create a\nstandard feature set for consistently classifying newsworthi-\nness in open domain microblog documents, and provide a\nsummary which improves upon the defined baseline when\nevaluated using CrowdTruth. Results show promising re-\nsults towards consistent classification on open domain doc-\numents, and significant improvements to our baseline for\nautomating event summarization on Twitter.\n\nKeywords\ntwitter, automatic summarization, event detection, text min-\ning, document classification\n\n∗MSc. Business Information Systems\n\n1.\n\nINTRODUCTION\n\nMicroblogging is the activity of sharing a small amount of\ninformation over the web. These small documents of infor-\nmation can include combinations of text and media content,\nand are typically shared over the public domain. Microblog-\nging has become increasingly popular for social as well as\nnews reporting. A 2015 survey by the Pew Research Cen-\nter found that over 60% of social media users of Facebook\nand Twitter actively source news from the sites, an increase\nin over 10% for both sites from 2013. [3] Particularly in\nnews, given the short time it takes to write and ability to\nshare with a mass audience instantaneously, it is attractive\nfor organizations to share smaller pieces of information as\nreal-time events unfold. As a result of this, an immense\namount of information surrounding an event is fragmented\nacross sites and accounts making it impossible for a human\nto wholly comprehend. As this information is disconnected\nand drowning in extraneous data, it is a difficult'

PDF Metadata using pdfminer

You will notice below the metadata in this particular PDF is virtually nonexistent. However, this code is simply a demonstration as to how one would extract such data.

In [8]:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
In [9]:
fp = open(file, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
In [10]:
{k:v if isinstance(v, bytes) else v.resolve() for k,v in doc.info[0].items()}
Out[10]:
{'Author': b'',
 'CreationDate': b"D:20160929165748Z00'00'",
 'Creator': b'LaTeX with hyperref package',
 'Keywords': b'',
 'ModDate': b"D:20160929165748Z00'00'",
 'Producer': b'dvips + GPL Ghostscript 9.05',
 'Subject': b'',
 'Title': b''}

Resolve Metadata to XML if exists

Depending on in which system your PDF was created, for example if it was electronically signed in something like Docusign, you may have information on the signers here including emails, names, and dates of form completions.

In [11]:
catalog_metadata = doc.catalog['Metadata']
resolved_xml = catalog_metadata.resolve()
In [12]:
from bs4 import BeautifulSoup

try:
    soup = BeautifulSoup(resolved_xml.get_data(), 'lxml')
except: #PDFNotImplementedError
    soup = BeautifulSoup(resolved_xml.rawdata, 'lxml')
In [13]:
print(soup.prettify(formatter=None))
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<?adobe-xap-filters esc="CRLF"?>
<html>
 <body>
  <x:xmpmeta x:xmptk="XMP toolkit 2.9.1-13, framework 1.6" xmlns:x="adobe:ns:meta/">
   <rdf:rdf xmlns:ix="http://ns.adobe.com/iX/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:description rdf:about="uuid:6631a5c2-be82-11f1-0000-c7fc2c5fefc0" xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
     <pdf:producer>
      dvips + GPL Ghostscript 9.05
     </pdf:producer>
     <pdf:keywords>
      ()
     </pdf:keywords>
    </rdf:description>
    <rdf:description rdf:about="uuid:6631a5c2-be82-11f1-0000-c7fc2c5fefc0" xmlns:xmp="http://ns.adobe.com/xap/1.0/">
     <xmp:modifydate>
      2016-09-29T16:57:48Z
     </xmp:modifydate>
     <xmp:createdate>
      2016-09-29T16:57:48Z
     </xmp:createdate>
     <xmp:creatortool>
      LaTeX with hyperref package
     </xmp:creatortool>
    </rdf:description>
    <rdf:description rdf:about="uuid:6631a5c2-be82-11f1-0000-c7fc2c5fefc0" xapmm:documentid="uuid:6631a5c2-be82-11f1-0000-c7fc2c5fefc0" xmlns:xapmm="http://ns.adobe.com/xap/1.0/mm/">
    </rdf:description>
    <rdf:description dc:format="application/pdf" rdf:about="uuid:6631a5c2-be82-11f1-0000-c7fc2c5fefc0" xmlns:dc="http://purl.org/dc/elements/1.1/">
     <dc:title>
      <rdf:alt>
       <rdf:li xml:lang="x-default">
        ()
       </rdf:li>
      </rdf:alt>
     </dc:title>
     <dc:creator>
      <rdf:seq>
       <rdf:li>
        ()
       </rdf:li>
      </rdf:seq>
     </dc:creator>
     <dc:description>
      <rdf:seq>
       <rdf:li>
        ()
       </rdf:li>
      </rdf:seq>
     </dc:description>
    </rdf:description>
   </rdf:rdf>
  </x:xmpmeta>
  <?xpacket end='w'?>
 </body>
</html>

You may now want to extract certain data by tags and process as you like

In [14]:
d = soup.find('xmp:createdate')
In [15]:
import pandas as pd

pd.to_datetime(d.text).strftime('%Y-%m-%d')
Out[15]:
'2016-09-29'