Tesseract

10 November 2008

Originally published on macresearch.org, around 2008. Reproduced from the author's archive; some links may no longer resolve.

Do-It-Yourself Optical Character Recognition (OCR)

An application that does Optical Character Recognition (OCR) — extracts text from raster images — will usually set you back a few bob, but if you are not averse to the terminal, you can have a sound OCR solution for nothing. It’s all thanks to an initiative of Google to index all of the information in existence. Some information takes the form of text embedded in images; humans can read it without any problem, but computers find it more of a challenge. This has led Google to start up the open-source project OCRopus.

OCRopus is not supported on Mac OS X at this point, but it is based on an older tool called Tesseract (link no longer available), and Tesseract is quite easy to compile and run. Here’s how:

Download (link no longer available) the latest source code. At the time of writing, it’s version 2.03 (link no longer available).
From the same download page, download (link no longer available) language data for any language you want to use OCR for. The latest pack for English is called English language data for Tesseract (2.00 and up) (link no longer available).
Unpack the source code bundle, open Terminal, and change into the root directory (‘tesseract-2.03’ at time of writing).
Issue the standard UNIX build command sequence, and enter your password when prompted.
```
./configure
make
sudo make install
```
Unpack the language data, and move or copy each item in the tessdata directory into the directory /usr/local/share/tessdata/. Replace the files already in /usr/local/share/tessdata/ — which are just placeholders — with the ones you unpacked.
```
cd ~/Downloads
sudo cp tessdata/* /usr/local/share/tessdata/
```

Your installation is now complete. Time to test it.

Tesseract works only with TIFF images, so if you have another format, you need to use an application like Preview to convert it to TIFF. Once you have a TIFF image with some text in it — and it must have the extension .tif — you can use Tesseract to extract the text like this:

    /usr/local/bin/tesseract someimage.tif someimage_text

This should produce a text file called someimage_text.txt.

My tests have shown that tesseract does a reasonable job of extracting text, but it only works well with images of reasonably high resolution. If the resolution is too low, you end up with gobbledygook.

Lastly, you can use Automator to make tesseract a bit more user friendly. I’ve created a workflow that prompts the user to select an image, converts it into TIFF format, runs Tesseract, and presents the text to the user in their default text editor.