简体   繁体   中英

How can I tell Tesseract that my font has a particular size?

I have a collection of type-written image captions which look like this:

打字文字

I know that the typewriter is consistent and monospace, with characters measuring 14x22px (as measured from the top of a capital letter to the bottom of a descender).

Tesseract is producing output like this:

OCR结果为打字文本

The results are mostly good when Tesseract has detected the correct bounding boxes for the letters. But there are many strings of letters which are clumped together (eg "Ea", "tree", "fr" and "om" on the first line). These are always transcribed incorrectly and account for the majority of errors.

This is frustrating because I know a priori that all the characters are of a particular size. Is it possible pass this knowledge on to the tesseract command line tool?

My command to generate the box file is:

tesseract foo.jpg foo batch.nochop makebox

If possible, I'd prefer to avoid training Tesseract on the font—I don't have any manually transcribed samples, so building a corpus of training data would require some effort.

I'm not sure that Tesseract throws connected characters completely off as Noremac said.

Actually I think that it includes a chopping of joined characters whenever the result of a word detection is unsatisfactory, as explained in the paragraph 4.1 of An Overview of the Tesseract OCR Engine

And I also think that once it finds a fixed pitch text, it should automatically chop the text, even if the characters are connected (look at figure 2 of the same paper).

I know that it's a little bit late to add this answer, but maybe it will help some future visitors!

The issue isn't the font size as much as it is with the letters connecting. If you zoom in on the above images with a program that will show the actual pixels (rather than blurring them together) you can see that those grouping two characters are actually connected. tessearctOCR is completely based on connected components so if they are connected at all then it throws it completely off. I see a couple of options:

  1. If possible, give it a higher resolution image where there is more separation between the characters
  2. Adjust the preprocessing to do a more strict threshold.
    1. I noticed that the pixel connecting the E and the a on the first occurrence is lighter so adjusting the threshold will remove that connection. However, this could affect more than what you want, such as disjointing characters where you don't expect.

For updating the thresholding consider this: https://groups.google.com/forum/#!topic/tesseract-ocr/JRwIz3xL45U

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM