Author Topic: optimize PDFs for machine learning / AI model training (Read 44982 times)

edgaughan@hotmail.com · « **on:** February 24, 2024, 11:58:19 PM »

My company trains and grounds Large Language Model (LLM) with PDF files. The problem is the valuable part of a
PDF is the body text, while the Table of contents, footnotes, index, and headers/footers create problems (especially with semantic search).

Do any of your utilities allow for batch processing of files that will:
- delete all text below a point size (ie delete text =<9 points will remove foot notes and index)
- remove Table of Contents
- remove all text in margins

There is a lot of demand for a user-friendly tool that preps PDFs for machine learning.

RTT · « **Reply #1 on:** February 26, 2024, 02:48:50 AM »

There are functionalities to extract text, with the possibility to get font information (name, size,...), but not to edit it.

Take note it's not easy to segment a PDF in order to isolate these parts you want to remove. Internally, for the worst-case scenarios, you may have a "goto xy" and "print command" for each of the characters, without any specific order. There is no indication of what is a word, paragraph, etc. You need functionality like the used in OCR tools, that are able to provide that type of feature extraction in a useful format like hOCR.

RTTSoftware Support Forum

Author Topic: optimize PDFs for machine learning / AI model training (Read 44982 times)

edgaughan@hotmail.com

optimize PDFs for machine learning / AI model training

RTT

Re: optimize PDFs for machine learning / AI model training