PDF Parser scripting

I want to extract data from a pdf file. I was able to get the data I wanted using PyPDF2 on a python 3.8 environment. Could I install this package in Ignition and use the same code in Ignition scripting? Is there a better approach to get what I want?

This is the code I used, for reference.

import PyPDF2 as pdflib
import re

pdf = pdflib.PdfReader('data.PDF')
txt = u'\n'.join(pg.extract_text() for pg in pdf.pages)

pattern = r'Dnia:.*?\n'
data = re.findall(pattern, txt)

# Regular expression to find dates and numbers
pattern_datesnum = re.compile(r'\d{2}\.\d{2}\.\d{4}|\+?\d[\d\.]*')

result = []
for line in data:
    a = line.split()[1].replace('.', '/')
    b = [string.replace('.', '') for string in line.split()[2:]]
    c = a + " " + " ".join(b)
    result.append(c)

new_list = []
for element in result:
    # Remove numbers preceded by "Page:"
    element_without_page = re.sub(r'Page:\s*\d+/\d+\s*', '', element)
    # Keep only numbers and "-"
    numbers_and_hyphens = re.findall(r'[-/\d]+', element_without_page)
    # Join the numbers and hyphens found
    new_list.append(' '.join(numbers_and_hyphens))

According to the PyPDF documentation, this has C dependencies, so the prospect of using it in Jython seems problematic.

it does use C extensions for some algorithms to improve performance.

Ok thanks, I'll try another package then.

How did you get this working in ignition? I am about to do the same thing and looking for pointers.

Since it requires C dependencies he didn’t. You could set up a python flask server that runs your c-dependent python scripts and talk to that from ignition via web verbs and respond to Ignition via web dev module (or by putting results in a place igniton knows to look like a db, or a file in a known directory).

You could also shell out to a script via ProcessBuilder and call a batch/shell script that runs your python script with arguments.

I just started looking at process Builder from another thread, thanks!

1 Like