简体繁体 English

在Tess-Two上获取单词列表

[英]Get word list on Tess-Two

原文 2013-05-16 00:23:13 7 1 android/ tesseract

I'm developing an app that utilizes OCR on Android. 我正在开发一款在Android上使用OCR的应用。 The tess-two sample is working pretty good and I can get the full OCR text but I want to know how I can get the individual words inside the boxes that Tesseract actually outputs. tess-two样本工作得非常好，我可以获得完整的OCR文本，但我想知道如何在Tesseract实际输出的框内获得单个单词。 I can use getWords().getBoxRects() to get a list of the bounding boxes and it seems that the getWords() function does what I want but it returns a Pixa object and I'm not sure how a word list (of whatever words are contained inside the boxes) will be obtained from that. 我可以使用getWords（）。getBoxRects（）来获取边界框的列表，似乎getWords（）函数做了我想要的但它返回了一个Pixa对象，我不知道如何一个单词列表（无论如何）将从中获取单词（包含在框内）。

The output I am looking for is a map with the following key-value: 我正在寻找的输出是一个具有以下键值的地图：

Word : Bounding box 单词：边界框

Any tips would be great. 任何提示都会很棒。

1 个解决方案

You can parse the hOCR output to obtain the words and their coordinates. 您可以解析hOCR输出以获取单词及其坐标。 See Export HOCR output for tesseract OCR in android . 请参阅android中的tesseract OCR的导出HOCR输出。

Or use ResultIterator API, if tess-two supports it. 或者使用ResultIterator API，如果tess-two支持它。