PDF Parser scripting

mmadariaga · June 11, 2024, 9:59am

I want to extract data from a pdf file. I was able to get the data I wanted using PyPDF2 on a python 3.8 environment. Could I install this package in Ignition and use the same code in Ignition scripting? Is there a better approach to get what I want?

This is the code I used, for reference.

import PyPDF2 as pdflib
import re

pdf = pdflib.PdfReader('data.PDF')
txt = u'\n'.join(pg.extract_text() for pg in pdf.pages)

pattern = r'Dnia:.*?\n'
data = re.findall(pattern, txt)

# Regular expression to find dates and numbers
pattern_datesnum = re.compile(r'\d{2}\.\d{2}\.\d{4}|\+?\d[\d\.]*')

result = []
for line in data:
    a = line.split()[1].replace('.', '/')
    b = [string.replace('.', '') for string in line.split()[2:]]
    c = a + " " + " ".join(b)
    result.append(c)

new_list = []
for element in result:
    # Remove numbers preceded by "Page:"
    element_without_page = re.sub(r'Page:\s*\d+/\d+\s*', '', element)
    # Keep only numbers and "-"
    numbers_and_hyphens = re.findall(r'[-/\d]+', element_without_page)
    # Join the numbers and hyphens found
    new_list.append(' '.join(numbers_and_hyphens))

justinedwards.jle · June 11, 2024, 10:06am

According to the PyPDF documentation, this has C dependencies, so the prospect of using it in Jython seems problematic.

it does use C extensions for some algorithms to improve performance.

mmadariaga · June 11, 2024, 10:08am

Ok thanks, I'll try another package then.