The pdf2txt project combines interfaces to a number of PDF to text converters with text preprocessors that refine the converted text for use in further NLP applications.
This project has been published to maven central and can be used by
sbt and other build tools as a library dependency. Include a line like this in
build.sbt to incorporate the main project along with all the subprojects:
libraryDependencies += "org.clulab" %% "pdf2txt" % "1.1.2"
Pdf2txtApp can be run directly from the pre-built
jar file. The only prerequisite is Java. Startup is significantly quicker than when it runs via
The PDF converters are divided into two categories. Some converters work locally, with no network connection needed, while others depend on remote servers to perform the conversion. The default is the local tika converter:
This converter is a combination of Ghostscript for conversion of PDF to images and Tesseract for conversion of images to text. It depends on both of these programs having been installed in advance and being available on the
$PATHif default settings are used. The settings can be adjusted. See the subproject's README.md for details. This converter does not do well on any but the simplest pages, but it is able to process images embedded in PDFs.
This Python project is further wrapped in Python code included as a resource with this project. It gets run as an external process using the
python3command which must be available on the
pdfminerneeds to have been installed in advance, possibly with
pip install pdfminer.
This executable program needs to be installed on the local computer and accessible via the operating system
$PATHso that the
pdftotextcommand can run. It is started as an external process to perform the conversion. See the README.md file for configuration details.
Science Parse is a Scala library that parses scientific papers. The pre-built jars are included in this project because recent versions are no longer available in standard repositories (e.g., maven central). This converter relies on large machine learning models which are downloaded when the converter is first used.
If your text has already been converted from PDF and only needs to be preprocessed. then this is the "converter" to use. It is implemented directly in this project rather than in a subproject. In contrast to the others, it reads files matching *.txt rather than *.pdf.
Apache Tika provides a Java library which is included as a dependency for this project. This is the default converter.
This converter provides an interface to Adobe's online PDF Extract service. The service requires credentials and eventual payment if used beyond the trial limits. See the adobe subproject's README.md for configuration details. The service returns a zip file containing a description of the PDF. The zip files are saved alongside the PDFs and will be reused if the same PDF is converted again. Converted text is generated wholly from the zip file and if one is found with the PDF, the call to the service is skipped (and the credentials are not used or needed).
Amazon provides via AWS a similar online Textract service. The service requires credentials and eventual payment if used beyond the trial limits. An S3 bucket may also be required. See the amazon subproject's README.md for configuration details. The service converts the PDF document into images and performs optical character recognition (OCR) to recover the text. It knows about pages, lines, and words, but not about paragraphs or other logical document structure. Input files of more than one page need to temporarily reside in an S3 bucket. If no bucket is configured (the value is an empty string), none is used, but that will cause errors if a PDF has more than one page.
Google's Cloud Vision API also offers PDF to text conversion. The service requires credentials and eventual payment if used beyond the trial limits. A cloud storage bucket is required. See the google subproject's README.md for configuration details. The service separates the PDF into pages and performs optical character recognition (OCR) on each one separately. Both input and output files need to temporarily reside in a storage bucket. The service returns a json file containing a description of the PDF. The json files are saved alongside the PDFs and will be reused if the same PDF is converted again. Converted text is generated wholly from the json file and if one is found with the PDF, the call to the service is skipped (and the credentials are not used or needed).
Microsoft has its own computer vision service. As with several of the other converters, its processing of PDFs is an extension of more general image processing capabilities. The PDF is converted into an image and then scanned for text. In this case, no cloud storage is needed and no temporary files are created. Credentials are required and there are eventual charges after a trial period. Free conversions are limited both in number of pages (to two) and submission rate (20 calls per minute). There are also image size limits. See Microsoft documentation for input requirements. The subproject's README.md has information on how to configure the credentials.
Preprocessors can be configured on (true) and off (false) as shown later, but they are by default applied in the order given here. That can be changed if the project is used as a library, since it is an (ordered) array of preprocessors that gets passed around. Because actions of one preprocessor can affect how the next might work or the previous might have worked, the list is traversed multiple times until the output no longer changes.
This preprocessor removes blank lines that some PDF converters leave between populated lines of text even though there is no paragraph break and usually not even the end of a sentence intervening. After the blank line is removed, text parsers can usually piece together a sentance that is split across the remaining lines.
Blank lines are otherwise assumed to end paragraphs. Sentences cannot span paragraphs, so at the end of each paragraph a period is added if necessary. This prevents parsers from combining things like multiple section headings into a single nonsensical sentence.
Conversion of unicode characters is controlled by a translation table which can remove accents, spell out Greek letters, convert to spaces, etc. and a list of accented characters which might be spared from such conversion. How these are used is controlled by parameters. In the command line interface, they are hard coded, but the library provides access.
Headers and titles often indicated with words that have been capitalized. Unfortunately, this can confuse part of speech taggers and named entity recognizers. Case is restored here so that words appear as they would in normal sentences for more accurate processing.
Numbers are sometimes converted so that spaces separate some of the digits or a comma lands after a space as in 123 ,45. This preprocessor tries to remove unnecessary spaces within numbers.
Many PDF converts have difficulties with ligatures, like ﬃ typeset as single glyphs, resulting in spaces inserted into words. Such situations are detected and resolved with this preprocessor. "coe ffi cient" would be corrected to "coefficient". In order to do so, it must have a fairly good idea of what is a word or not and even whether one word is more probable than another. Therefore, this preprocessor (and all the remaining ones) makes use of a language model described in the next section.
Words, particularly in justified text, are often hyphenated and split between lines of text. Some words already include hyphens that are not optional. This preprocessor, with the aid from a language model, attempts to find words split across lines and unite the parts.
Given the many kinds of dashes (-, –, —, etc.) within words, PDF converters sometimes can't tell whether the letters after belong to the same word or the next one and unwanted spaces can get inserted. Words with hyphens are recombined here. For example, "left- handed" might be restored to "left-handed" or "two- year- old" to "two-year-old".
Finally, sometimes spaces just appear magically within words. They might be removed here, but by default the Never language model is configured out of an abundance of caution. Library users can change this.
The preprocessor unit tests include illustrative examples of transformations.
The primary reponsibility of the language models is to determine whether word "parts" should be joined so that a word is whole again. The parts may have resulted from spaces or hyphens having been inserted between characters of a word. The programming interface looks like this:
def shouldJoin(left: String, right: String, prevWords: Seq[String]): Boolean
It decides whether a sentence starting "Wordone wordtwo left right" is OK or should have been "Wordone wordtwo leftright". This might be calculated based on something like
P(Wordone wordtwo leftright | Wordone wordtwo) > P(Wordone wordtwo left | Wordone wordtwo)
P(leftright) > P(left)
The language models below are currently available. Both the
glove use not only vocabulary from their respective dictionaries, but dynamically add to it words from the document they are currently processing. A novel word such as a product or brand name that is seen without a hyphen in a document can be used to de-hyphenate other instances in the document.
Always join left and right, which is useful in testing.
Use word frequencies derived from gigaword. Since counts are involved, this is coded as a
Use words, without frequencies, derived from glove. Since these are without counts, this is called a
Never join left and right, which is again useful in testing.
A HuggingFace language model is also anticipated.
Command Line Syntax
Although this project is intended more as a library, there are several command line applications included. Many read all the PDF files in an input directory, convert them to text, preprocesses them for potential use with other NLP projects, and then write them to an output directory. They differ mainly in which component converts the PDF to text. Pdf2txtApp should be noted in particular, since it is the most encompassing. Here are highlights from its help text.
From the command line with sbt and having the git repo, use
sbt "run <arguments>"
or from the command line after having run "sbt assembly" and changed directories (target/scala-2.12) or after having downloaded the jar file,
java -jar pdf2txt.jar <arguments>
converts all PDFs in the current directory to text files.
-in ./pdfs -out ./txts
converts all PDFs in
./pdfs to text files in
-converter pdftotext -wordBreakBySpace false -in doc.pdf -out doc.txt
pdftotxt without the
-converter text -in file.txt -out file.out.txt
preprocesses file.txt resulting in file.out.txt
To get the full help text, use
This software uses lots of memory for multiple large neural network models and dictionaries. It may not run on machines with less than 16GB of memory, particulary with ScienceParse, and even then, settings may need to be adjusted so that the memory available can also be used. If you encounter errors indicating memory exhaustion, such as
[error] ## Exception when compiling 44 sources to /clulab/pdf2txt-project/pdf2txt/target/scala-2.11/classes [error] java.lang.OutOfMemoryError: Java heap space
Exception in thread "ModelLoaderThread" java.lang.OutOfMemoryError: Java heap space
then here are some tips to try:
sbtcan't complete commands like
assemblyfor lack of memory, then the
-Xmxsetting in .jvmopts might be increased. The Windows version of
sbtseems to ignore this file, so it may be necessary to instead set the value of the environment variable
_JAVA_OPTIONS. Depending on the shell, that might be with
sbtcan't complete the
testcommand, then the value for
ThisBuild / Test / javaOptionsin test.sbt needs to be adjusted.
runcommand doesn't work, then use the setting for
run / javaOptionsin build.sbt.
If you execute the jar file from Java and run out of memory, then the environment variable
_JAVA_OPTIONSis the best place to make the change. The command for Windows is above. For other operating systems, it is usually
java -jaris problematic, then lowering the value for the
-threadsargument can reduce memory requirements because fewer documents will be processed at the same time.
In each case adjust the number before the
g (gigabytes) as needed.
Please note that the startup messages from fatdynet that are printed to
stderr like the ones below are normal and not indicative of a problem.
[error] [dynet] Checking /home/user/pwd for libdynet_swig.so... [error] [dynet] Checking /home/user for libdynet_swig.so... [error] [dynet] Extracting resource libdynet_swig.so to /tmp/libdynet_swig-8897097308525612384.so... [error] [dynet] Loading DyNet from /tmp/libdynet_swig-8897097308525612384.so... [error] [dynet] random seed: 2522620396 [error] [dynet] allocating memory: 512MB [error] [dynet] memory allocation done.