Article Image

Testing PDF Extraction Tools

Author Image
estrus2_o72siy
7 mins read

We are now 8 weeks into the Official Inquiries project. One of the most time-consuming challenges the project faces is that of turning PDF files of inquiries into clean, readable text. Converting a PDF file into a text file is not a simple matter of changing a file extension or copying and pasting from one to another.

To accomplish this conversion, we use free, open source tools which are capable of reading the PDF and rewriting it as a text file. Of course, none of these tools can perfectly interpret the file - especially not bespoke to our personal preferences - and each inquiry’s PDF will be different, so we have tested and compared a number of different ones, to see which one works best for our purposes. After all, the more work the tools can do, the less effort we have to put into tidying ourselves before we can present the text online!

As an example, let’s take a page of the US Senate’s PSI report into the Financial Crisis:

Our objective is to turn this PDF into a text file, which can then be tidied and put online. When we began the Official Inquiries project, our default tool for this was PDFMiner. Here is what PDFMiner makes of that page of text:

A. Subcommittee Investigation

Important cause of the crisis, it provides new, detailed, and compelling evidence of what happened. In so doing, we hope the Report leads to solutions that prevent it from happening
again.

1 EXECUTIVE SUMMARY

Investigation into some of the key causes of the financial crisis. Since then, the Subcommittee has engaged in a wide-ranging inquiry, issuing subpoenas, conducting over 150 interviews and depositions, and consulting with dozens of government, academic, and private sector experts.
The Subcommittee has accumulated and reviewed tens of millions of pages of documents, including court pleadings, filings with the Securities and Exchange Commission, trustee reports, prospectuses for public and private offerings, corporate board and committee minutes, mortgage transactions and analyses, memoranda, marketing materials, correspondence, and emails. The Subcommittee has also reviewed documents prepared by or sent to or from banking and In November 2008, the Permanent Subcommittee on Investigations initiated its ^LB. Overview.

(1) High Risk Lending:

Case Study of Washington Mutual Bank

Securities regulators, including bank examination reports, reviews of securities firms, enforcement actions, analyses, memoranda, correspondence, and emails. In April 2010, the Subcommittee held four hearings examining four root causes of the financial crisis. Using case studies detailed in thousands of pages of documents released at the
hearings, the Subcommittee presented and examined evidence showing how high risk lending by U.S. financial institutions; regulatory failures; inflated credit ratings; and high risk, poor quality financial products designed and sold by some investment banks, contributed to the financial
crisis. This Report expands on those hearings and the case studies they featured. The case studies are Washington Mutual Bank, the largest bank failure in U.S. history; the federal Office of Thrift Supervision which oversaw Washington Mutual’s demise; Moody’s and Standard &
Poor’s, the country’s two largest credit rating agencies; and Goldman Sachs and Deutsche Bank, two leaders in the design, marketing, and sale of mortgage related securities. This Report devotes a chapter to how each of the four causative factors, as illustrated by the case studies, fueled the 2008 financial crisis, providing findings of fact, analysis of the issues, and
recommendations for next steps.

2

The first chapter focuses on how high risk mortgage lending contributed to the financial crisis, using as a case study Washington Mutual Bank (WaMu). At the time of its failure, WaMu was the nation’s largest thrift and sixth largest bank, with 300billioninassets,300 billion in assets, 188 billion in deposits, 2,300 branches in 15 states, and over 43,000 employees. Beginning in 2004, it
embarked upon a lending strategy to pursue higher profits by emphasizing high risk loans. By 2006, WaMu’s high risk loans began incurring high rates of delinquency and default, and in 2007, its mortgage backed securities began incurring ratings downgrades and losses. Also in 2007, the bank itself began incurring losses due to a portfolio that contained poor quality and
fraudulent loans and securities. Its stock price dropped as shareholders lost confidence, and depositors began withdrawing funds, eventually causing a liquidity crisis at the bank. On September 25, 2008, WaMu was seized by its regulator, the Office of Thrift Supervision, placed in receivership with the Federal Deposit Insurance Corporation (FDIC), and sold to JPMorgan
Chase for 1.9billion.Hadthesalenotgonethrough,WaMusfailuremighthaveexhaustedtheentire1.9 billion. Had the sale not gone through, WaMu’s failure might have exhausted the entire 45 billion Deposit Insurance Fund.

As you can see, there are quite a few problems: there is lots of extra spacing in the text, artefacts like “^L” have been added and, most noticeably, the text is actually jumbled and not in the order it is in the PDF. The PSI report into the Financial Crisis is hundreds of pages long, so tidying this up manually would be very difficult for us. With that in mind, we decided to test Apache PDFbox and see if it couldn’t do a little better. Here’s what it output:

  1. EXECUTIVE SUMMARY

Subcommittee Investigation

In November 2008, the Permanent Subcommittee on Investigations initiated its investigation into some of the key causes of the financial crisis. Since then, the Subcommittee has engaged in a wide-ranging inquiry, issuing subpoenas, conducting over 150 interviews and depositions, and consulting with dozens of government, academic, and private sector experts.
The Subcommittee has accumulated and reviewed tens of millions of pages of documents, including court pleadings, filings with the Securities and Exchange Commission, trustee reports, prospectuses for public and private offerings, corporate board and committee minutes, mortgage
transactions and analyses, memoranda, marketing materials, correspondence, and emails. The Subcommittee has also reviewed documents prepared by or sent to or from banking and

Securities regulators, including bank examination reports, reviews of securities firms, enforcement actions, analyses, memoranda, correspondence, and emails.
In April 2010, the Subcommittee held four hearings examining four root causes of the financial crisis. Using case studies detailed in thousands of pages of documents released at the hearings, the Subcommittee presented and examined evidence showing how high risk lending by U.S. financial institutions; regulatory failures; inflated credit ratings; and high risk, poor quality financial products designed and sold by some investment banks, contributed to the financial crisis. This Report expands on those hearings and the case studies they featured. The case studies are Washington Mutual Bank, the largest bank failure in U.S. history; the federal Office of Thrift Supervision which oversaw Washington Mutual’s demise; Moody’s and Standard & Poor’s, the country’s two largest credit rating agencies; and Goldman Sachs and Deutsche Bank, two leaders in the design, marketing, and sale of mortgage related securities. This Report devotes a chapter to how each of the four causative factors, as illustrated by the case studies,
fueled the 2008 financial crisis, providing findings of fact, analysis of the issues, and recommendations for next steps.

  1. Overview

(1) High Risk Lending:

Case Study of Washington Mutual Bank

The first chapter focuses on how high risk mortgage lending contributed to the financial crisis, using as a case study Washington Mutual Bank (WaMu). At the time of its failure, WaMu was the nation’s largest thrift and sixth largest bank, with 300billioninassets,300 billion in assets, 188 billion in deposits, 2,300 branches in 15 states, and over 43,000 employees. Beginning in 2004, it embarked upon a lending strategy to pursue higher profits by emphasizing high risk loans. By 2006, WaMu’s high risk loans began incurring high rates of delinquency and default, and in 2007, its mortgage backed securities began incurring ratings downgrades and losses. Also in 2007, the bank itself began incurring losses due to a portfolio that contained poor quality and fraudulent loans and securities. Its stock price dropped as shareholders lost confidence, and depositors began withdrawing funds, eventually causing a liquidity crisis at the bank. On September 25, 2008, WaMu was seized by its regulator, the Office of Thrift Supervision, placed in receivership with the Federal Deposit Insurance Corporation (FDIC), and sold to JPMorgan Chase for 1.9billion.Hadthesalenotgonethrough,WaMusfailuremighthaveexhaustedtheentire1.9 billion. Had the sale not gone through, WaMu’s failure might have exhausted the entire 45 billion Deposit Insurance Fund.

This result is much better. We also tested another tool called Poppler, which delivered a similar result to PDFbox, but with some of the extra artefacts of PDFMiner. Given that PDFbox had given us such good results with the Financial Crisis report, we also tested it on pages from the Chilcot Report and the Leveson Report. Although the difference was not as dramatic, PDFbox was the most consistent of the three and, going forward, it will be the tool we use most to process PDF inquiries into text.

However, while all of these tools are very useful for processing modern reports like Chilcot and Leveson, which released with well-made PDF files for public consumption, some reports, particularly older ones, are catalogued in PDFs filled with simple scans of the paper originals. Computers find it much harder to read text within image files and none of the three tools we tested are capable of reading these scanned PDFs. If you try to extract from a scanned PDF with one of these tools, it will just give you a list of image file names!

While the main PDF for the PSI Financial Crisis report is a modern PDF, the extended report, in four volumes, is scanned. So our next challenge is to work out how we can get text from these volumes. As a preliminary step, we tested Google Cloud Platform’s Vision to see what it made of a few pages of one of the volumes.

As you can see, we have our work cut out for us!
If you have any suggestions or would like to help our project, please visit our website or our GitHub re.

We are the CKAN experts.

Datopian are the co-creators, co-stewards, and one of the main developers of CKAN. We design, develop and scale CKAN solutions for everyone from government to the Fortune 500. We also monitor client use cases for data to ensure that CKAN is responding to genuine challenges faced by real organizations.

Related blog posts

Case Study Image
10 min read

Note Taking Software

There are a number of note-taking applications in the market, but if you are a power user, it can be hard to find exactly the “right solution”. Depending on your needs, you may want to use several dif...

Author Image

estrus2_o72siy

Case Study Image
3 min read

Contact form providers

Finding open source software that creates both beautiful and customised contact forms for your website is tricky. Software that allows you to store responses, inform you of new contact requests, enabl...

Author Image

estrus2_o72siy

Case Study Image
7 min read

Testing PDF Extraction Tools

We are now 8 weeks into the Official Inquiries project. One of the most time-consuming challenges the project faces is that of turning PDF files of inquiries into clean, readable text. Converting a PD...

Author Image

estrus2_o72siy