Read PDF Reports and Extract Data

TimE · April 28, 2020, 9:41pm

Over the years we have created hundreds of PDF reports. We now want to extract the data from these reports and put into a database. I know there are external programs that can do this but if possible I would like to integrate this into my project. Is this wishful thinking?

Thanks,
Tim

Niken_Panchal · October 21, 2022, 9:20am

I am also looking something similar to upload the values in pdf file table column and write it to tag. Please acquaint with either scripting or method to execute this application

pturmel · October 21, 2022, 12:27pm

I don't have an answer for you, and I suspect there's not an easy one. Tim didn't get an answer in the past 2½ years. A web search for "java pdf parsers" might get you started, but expect to implement a custom parser for every possible page layout you have in your PDFs.

Or abandon this idea, and go to the data source that produced your PDFs.

Niken_Panchal · October 24, 2022, 7:34am

Thanks for getting back to me. I am planning to use third party parser like PyPDF2 Library to extract. I hope it works and will post once succeed.

Transistor · October 24, 2022, 12:01pm

Be prepared for trouble. From my experience creating PDFs with PHP libraries, the file data order does not have to match the layout order. In other words, the PDF file could start by specifying the footer and header, followed by a table, followed by the heading above it, etc. It all depends on how the rendering engine was programmed. Don't expect to be able to read from top to bottom and left to right. If the layout is fairly simple you may be able to get it to work.

Niken_Panchal · November 7, 2022, 11:00am

I installed PyPDF2 into Ignition Pylib folder, installation is fairly easy to do. When I import PyPDF2 function into script, the first error was Non-ASCII character in source file and then got it fixed using PEP header line. But I couldn't find solution for second error stating 'Expecting RPAREN ':' Syntax Error', which I think probably needs to rewrite few lines of source code. I abandon PyPDF2 library and working on to use Apache PDFBox Jar file. I'm looking for installing .jar files into Ignition.

PGriffith · November 15, 2022, 4:39pm

If you just want to load an arbitrary .jar file, you can use either approach here: