optimize PDFs for machine learning / AI model training

PDF-ShellTools > Ideas/Suggestions

(1/1)

edgaughan@hotmail.com:
My company trains and grounds Large Language Model (LLM) with PDF files. The problem is the valuable part of a
PDF is the body text, while the Table of contents, footnotes, index, and headers/footers create problems (especially with semantic search).

Do any of your utilities allow for batch processing of files that will:
- delete all text below a point size (ie delete text =<9 points will remove foot notes and index)
- remove Table of Contents
- remove all text in margins

There is a lot of demand for a user-friendly tool that preps PDFs for machine learning.

RTT:
There are functionalities to extract text, with the possibility to get font information (name, size,...), but not to edit it.

Take note it's not easy to segment a PDF in order to isolate these parts you want to remove. Internally, for the worst-case scenarios, you may have a "goto xy" and "print command" for each of the characters, without any specific order. There is no indication of what is a word, paragraph, etc. You need functionality like the used in OCR tools, that are able to provide that type of feature extraction in a useful format like hOCR.

Navigation

[0] Message Index

Go to full version