简体繁体 English

Ephesoft错误，学习从PDF转换的TIFF文档

[英]Ephesoft error with learning tiff documents that have been converted from PDF

原文 2015-01-24 05:10:03 5 1 tesseract/ ghostscript/ imagemagick-convert/ ephesoft

I am using the Ephesoft Community edition on a windows server 2003 on AWS instance. 我在AWS实例的Windows Server 2003上使用Ephesoft社区版。 I am having issues with ephesoft reading certain tiff documents. 我在使用ephesoft读取某些tiff文档时遇到问题。 I have about 100 different tiff documents and about 70% of them work. 我有大约100个不同的tiff文档，其中大约70％有效。 These tiff documents were originally PDF's that we have converted using the lastest version of ghostscript and cleaned up using imagemagick from ephesoft. 这些tiff文档最初是PDF文档，我们已经使用最新版本的ghostscript对其进行了转换，并使用了ephesoft的imagemagick对其进行了清理。 We are using the following commands with ghostscript 我们将以下命令与ghostscript一起使用

-dNOPAUSE -r300 -sDEVICE=tiffg4 -dBATCH -dNOPAUSE -r300 -sDEVICE = tiffg4 -dBATCH

with imagemagick we are doing the following command 使用imagemagick我们正在执行以下命令

-compress group4 -压缩组4

When learning one of the tiff files that isn't working we are getting the following error in the log files 当学习无法使用的tiff文件之一时，我们在日志文件中收到以下错误

Drop Box Link to Stack Trace 投递箱链接到堆栈跟踪

And this is one of the Tiff document we are trying to have ephesoft learn 这是我们试图让Ephesoft学习的Tiff文档之一

Drop Box Link to Tiff Document 投递箱链接到Tiff文档

Is there something that I can do with ghostscript, imagemagick or any other software to fix this; 我可以使用ghostscript，imagemagick或任何其他软件来解决此问题吗？ or do I need to modify ephesoft in some way? 还是我需要以某种方式修改Ephesoft？

1 个解决方案

I found the solution by doing some more research. 我通过做更多的研究找到了解决方案。

The problem didn't involve Ghostscript or Imagmagick. 问题不涉及Ghostscript或Imagmagick。 It involved Tesseract and creating the HOCR file. 它涉及Tesseract和创建HOCR文件。 When Tesseract is creating the hocr file it is resolving the value of Texas as Te>. 当Tesseract创建hocr文件时，它会将Texas的值解析为Te>。 The community edition of Ephesoft cannot handle the special xml character like that and would throw the error as a result. Ephesoft的社区版本无法像这样处理特殊的xml字符，结果将引发错误。

The solution was to set a Tesseract property of blacklisting the <> symbols so that Tesseract would not include those or resolve to those. 解决方案是将Tesseract属性设置为将<>符号列入黑名单，以使Tesseract不包含这些符号或将其解析为。 My PDF's seem to be working correctly now and I am able to process them. 我的PDF现在似乎可以正常工作，我可以对其进行处理。