Pdf to text ocr command line

1/18/2024

I think the TextExtractor approach seems to be the most promising to me. Moreover that I had the impression that it failed (and hung up) on a 9-page document - but maybe I’ve read that wrong. I couldn’t find a way to limit the number of pages or exclude PDF files for Obsidian-OCR, but this is something I could look into too.I know that at least a share of them does not have the text already included but would have guessed that number lower - but that’s something I could try to improve! This means on the other hand that I might just need to improve or alter the already embedded Text inside the 60% PDFs it is missing. Like you say TextExtractor does OCR only for images and extracts already embedded text from PDFs if possible. I should read the plugin description more carefully.But this fits quite well with the whole problem I’m seeing I wasn’t aware that MacOS does their own OCR-run and was quite convinced it only used existing embedded data.I just ditched a full notion-obsidian sync setup I wrote and used for a year to move completely to obsidian and am quite amazed of the community so far! Wow, thank you so much for your fast, kind and elaborate answer! gs: The below command should convert multipage pdf to individual tiff files. The first version of Text Extractor used PDFjs and worked perfectly… unless you had more than a dozen files, then it hard crashed Obsidian.

While there are ways to do this, solutions are often flawed: they break often, or don’t scale well, or take too much time/resources. What makes the whole thing even more complex is that Obsidian is essentially a webapp, and so we’re limited to web technologies (js and wasm).OCR with TesseractJS works pretty well, but converting a 150-pages PDF into an image to OCR it is… not ideal. The plugin Text Extractor does not use OCR for PDFs, but tries to extract the text directly from the file.You mention that MacOS has already OCRed your files through its “Live Text” (iirc) feature, but unfortunately there is no API for 3rd party apps to use this data.Short answer: no, or it would already be built-in in Obsidian, or Text Extractor wouldn’t fail on so many files.

Shouldn’t it be rather easy to search for PDFs using full text search

0 Comments

Pdf to text ocr command line

Leave a Reply.

Author

Archives

Categories