RTTSoftware Support Forum

PDF-ShellTools => Ideas/Suggestions => Topic started by: edgaughan@hotmail.com on February 24, 2024, 11:58:19 PM

Title: optimize PDFs for machine learning / AI model training
Post by: edgaughan@hotmail.com on February 24, 2024, 11:58:19 PM
My company trains and grounds Large Language Model (LLM) with PDF files. The problem is the valuable part of a
PDF is the body text, while the Table of contents, footnotes, index, and headers/footers create problems (especially with semantic search).

Do any of your utilities allow for batch processing of files that will:
- delete all text below a point size (ie delete text =<9 points will remove foot notes and index)
- remove Table of Contents
- remove all text in margins 

There is a lot of demand for a user-friendly tool that preps PDFs for machine learning.

Title: Re: optimize PDFs for machine learning / AI model training
Post by: RTT on February 26, 2024, 02:48:50 AM
There are functionalities to extract text, with the possibility to get font information (name, size,...), but not to edit it.

Take note it's not easy to segment a PDF in order to isolate these parts you want to remove. Internally, for the worst-case scenarios, you may have a "goto xy" and "print command" for each of the characters, without any specific order. There is no indication of what is a word, paragraph, etc. You need functionality like the used in OCR tools, that are able to provide that type of feature extraction in a useful format like hOCR.