Author Topic: Extract ISBN from books collection  (Read 4000 times)

0 Members and 1 Guest are viewing this topic.

Lothar

  • Guest
Extract ISBN from books collection
« on: August 24, 2010, 06:16:46 PM »
I´m trying your program to extract ISBN from my books collection.

first problem:

on Batch tools -> Search & Extract I can only choose 1 page (too little, isbn is usually between page 3 to page 5) or all pages (way too much)

Why not the first and last 10 pages?

second problem:

the given regexp misses a lot of ISBNs and I´m not manage to use regexp that I found on other scripts:


RE_ISBN = re.compile("(?:ISBN[ -]*(?:|10|13)|International Standard Book Number)[:\s]*(?:|, PDF ed.|, print ed.|\(pbk\)|\(electronic\))[:\s]*([-0-9Xx]{10,25})",
                     re.MULTILINE)

// This is a combination of strict and relaxed versions of ISBN number format
var reISBN=/(ISBN[\:\=\s][\s]*(?=[-0-9xX ]{13})(?:[0-9]+[- ]){3}[0-9]*[xX0-9])|(ISBN[\:\=\s][ ]*\d{9,10}[\d|x])/g;


Your program is faster than any other tool that I tried and with the correct options seems the perfect tool for this job on a large collection.

RTT

  • Administrator
  • *****
  • Posts: 766
Re: Extract ISBN from books collection
« Reply #1 on: August 25, 2010, 01:31:14 AM »
Quote
on Batch tools -> Search & Extract I can only choose 1 page (too little, isbn is usually between page 3 to page 5) or all pages (way too much)

Why not the first and last 10 pages?
That's a good point! I will try to address this in the next version.

Quote
the given regexp misses a lot of ISBNs and I´m not manage to use regexp that I found on other scripts:


RE_ISBN = re.compile("(?:ISBN[ -]*(?:|10|13)|International Standard Book Number)[:\s]*(?:|, PDF ed.|, print ed.|\(pbk\)|\(electronic\))[:\s]*([-0-9Xx]{10,25})",
                     re.MULTILINE)

// This is a combination of strict and relaxed versions of ISBN number format
var reISBN=/(ISBN[\:\=\s][\s]*(?=[-0-9xX ]{13})(?:[0-9]+[- ]){3}[0-9]*[xX0-9])|(ISBN[\:\=\s][ ]*\d{9,10}[\d|x])/g;
The regular expressions component I'm using only supports a subset of the Perl regular expressions, making many of these regular expressions out there incompatible, if not modified to match the supported syntax.
You can check the supported syntax in the attached file.

Try with this one, that also contemplate the relaxed version:

(\d{3}[-]\d{1,5}[-]\d{1,7}[-]\d{1,6}[-][\d,x,X]|\d{1,5}[-]\d{1,7}[-]\d{1,6}[-][\d,x,X])|(ISBN[\:\=\s][ ]*\d{9,10}[\d|x])

In this case only the full match is important, so don't forget to set the capturing group 0 to "Extract".

If misses continue, please mail me one of these PDFs, so I can take a look.
Some of the misses can also be related to the quality of the extracted text. You can use the PDFView (text only mode), or the text extractor tool, to better understand what text the tool is processing.