Author Topic: optimize PDFs for machine learning / AI model training  (Read 2257 times)

0 Members and 1 Guest are viewing this topic.

  • Newbie
  • *
  • Posts: 1
optimize PDFs for machine learning / AI model training
« on: February 24, 2024, 11:58:19 PM »
My company trains and grounds Large Language Model (LLM) with PDF files. The problem is the valuable part of a
PDF is the body text, while the Table of contents, footnotes, index, and headers/footers create problems (especially with semantic search).

Do any of your utilities allow for batch processing of files that will:
- delete all text below a point size (ie delete text =<9 points will remove foot notes and index)
- remove Table of Contents
- remove all text in margins 

There is a lot of demand for a user-friendly tool that preps PDFs for machine learning.


  • Administrator
  • *****
  • Posts: 909
Re: optimize PDFs for machine learning / AI model training
« Reply #1 on: February 26, 2024, 02:48:50 AM »
There are functionalities to extract text, with the possibility to get font information (name, size,...), but not to edit it.

Take note it's not easy to segment a PDF in order to isolate these parts you want to remove. Internally, for the worst-case scenarios, you may have a "goto xy" and "print command" for each of the characters, without any specific order. There is no indication of what is a word, paragraph, etc. You need functionality like the used in OCR tools, that are able to provide that type of feature extraction in a useful format like hOCR.