Author Topic: Scan Dokuments and automatically write Metadata  (Read 5315 times)

0 Members and 1 Guest are viewing this topic.

daniel.moench

  • Newbie
  • *
  • Posts: 1
Scan Dokuments and automatically write Metadata
« on: March 29, 2016, 11:49:07 AM »
Hallo, in order to improve my workflow i´m in search for a function that allows me to tag or metadata my incoming and outgoing documents automaticaly.
A invoice, wich is eather scanned or already as pdf available, has always the same look.
That means, all the data,for example invoice number, date etc. that needs to be written, is always at the same place in the document.
The only separation which is necassary, is to to recognice the distributor.
What needs to be done, is to extract or recognice  that data and write it readable or searchable somewhere eather as a tag or in the metadata.
So that every document can be automaticaly sorted and added by its data.
The next step would be, that i can provide the document to the taxadviser, and  by the data provided he easylie can use the information for the booking process.
I look forward for every input i can get.
Daniel



RTT

  • Administrator
  • *****
  • Posts: 907
Re: Scan Dokuments and automatically write Metadata
« Reply #1 on: March 30, 2016, 12:06:50 AM »
First the PDFs need to be OCRed and the data you want to extract correctly showing in the text only PDF View (this only to visually verify if the PDFE text extractor is working with your OCRed PDFs). If that's the case, you may use the Task Automation Folders tool to set a folder monitor in a work folder that will trigger a script, every time a new PDF is saved to that monitored folder, that will extract the data and commit it to PDF metadata fields, and then move this metadata edited PDF to a files processed folder(s). How the script extracts the data will depend on the specificity of your PDFs, but generally this can be easily accomplished with regular expressions, and/or the scripts API Page.TextEx object that provide font information and text position under the PDF page (check the My Scripts batch tool, "Page TextEx example" script, for more info about this object).

In the task automation folders tool, in order to call the script, we use the "rename file" task, and the reference to the script name in the rename formula. The script must also return the new full path file name, in order for the rename file task to move the file to the processed files folder.

If you want, attach here to a forum reply some samples of these PDFs (or send them directly to me), already OCRed, let me know the data you want to extract and to what PDF metadata fields, and I will try to develop a script showing how all this can be accomplished.

After the data is extracted to metadata fields, you just need to provide your taxadviser with a .csv file created with the Export grid fields[ tool tool, along with the PDFs. Each .csv line will make reference to the PDF file name and to its relevant extracted data fields.