How to extract a text of a PDF Book?

Status
Not open for further replies.
Joined
Oct 8, 2014
Messages
1,648
Location
SANTOS, BR
Hi, I have to translate a book written in ancient latin, but don't know how to extract its text to begin with. It is in pdf format. Could anybody orient me in that matter?
Thanks in advance.

Oz
 
Last edited:
Originally Posted By: Pontual
Hi, I have to translate a book written in ancient latin, but don't know how to extract its text to begin with. It is in pdf format. Could anybody orient me in that matter?
Thanks in advance.

Oz


Latin, eh? Are you translating the Emerald Tablet?

If the book is protected by DRM you need to crack it with Apprentice Alf first. Then you should be able to open and convert in Calibre to a plain text file, which in turn you can convert and save as .doc or .rtf in your text editing software.

http://www.epubor.com/apprentice-alf-drm-removal-tools.html

http://calibre-ebook.com/download

https://www.libreoffice.org
 
It's a pdf, which is a page description language based on a subset of PostScript.
There isn't text in there, it just looks like text.

There likely is no text to extract, unless someone put it in there in the first place where it sits along side the thing you view.

From Calibre's faq about pdf:

PDF documents are one of the worst formats to convert from. They are a fixed page size and text placement format. Meaning, it is very difficult to determine where one paragraph ends and another begins. calibre will try to unwrap paragraphs using a configurable, Line Un-Wrapping Factor. This is a scale used to determine the length at which a line should be unwrapped. Valid values are a decimal between 0 and 1. The default is 0.45, just under the median line length. Lower this value to include more text in the unwrapping. Increase to include less. You can adjust this value in the conversion settings under PDF Input.

Also, they often have headers and footers as part of the document that will become included with the text. Use the Search and Replace panel to remove headers and footers to mitigate this issue. If the headers and footers are not removed from the text it can throw off the paragraph unwrapping. To learn how to use the header and footer removal options, read All about using regular expressions in calibre.

Some limitations of PDF input are:

Complex, multi-column, and image based documents are not supported.
Extraction of vector images and tables from within the document is also not supported.
Some PDFs use special glyphs to represent ll or ff or fi, etc. Conversion of these may or may not work depending on just how they are represented internally in the PDF.
Links and Tables of Contents are not supported
PDFs that use embedded non-unicode fonts to represent non-English characters will result in garbled output for those characters
Some PDFs are made up of photographs of the page with OCRed text behind them. In such cases calibre uses the OCRed text, which can be very different from what you see when you view the PDF file
PDFs that are used to display complex text, like right to left languages and math typesetting will not convert correctly

To re-iterate PDF is a really, really bad format to use as input. If you absolutely must use PDF, then be prepared for an output ranging anywhere from decent to unusable, depending on the input PDF.
 
Two straightforward options:

1. Print it, scan it and OCR it.

2. If you have the full Adobe Acrobat product (not just Adobe PDF Reader) then you can OCR it inside Acrobat and save it as a .doc file.
 
As long as the text in the pdf is actual text and not a series of scanned images of text, uploading it to Google Docs will allow you to convert the document to Google Docs, whereupon you'll have yourself some editable text.
 
The book was digitally scanned, its a bunch of photos then. I've tried BRZED way also, still couldn't get the text. I think I'll need to print it down.
 
Status
Not open for further replies.
Back
Top