简体   繁体   English

Ephesoft错误,学习从PDF转换的TIFF文档

[英]Ephesoft error with learning tiff documents that have been converted from PDF

I am using the Ephesoft Community edition on a windows server 2003 on AWS instance. 我在AWS实例的Windows Server 2003上使用Ephesoft社区版。 I am having issues with ephesoft reading certain tiff documents. 我在使用ephesoft读取某些tiff文档时遇到问题。 I have about 100 different tiff documents and about 70% of them work. 我有大约100个不同的tiff文档,其中大约70%有效。 These tiff documents were originally PDF's that we have converted using the lastest version of ghostscript and cleaned up using imagemagick from ephesoft. 这些tiff文档最初是PDF文档,我们已经使用最新版本的ghostscript对其进行了转换,并使用了ephesoft的imagemagick对其进行了清理。 We are using the following commands with ghostscript 我们将以下命令与ghostscript一起使用

-dNOPAUSE -r300 -sDEVICE=tiffg4 -dBATCH -dNOPAUSE -r300 -sDEVICE = tiffg4 -dBATCH

with imagemagick we are doing the following command 使用imagemagick我们正在执行以下命令

-compress group4 -压缩组4

When learning one of the tiff files that isn't working we are getting the following error in the log files 当学习无法使用的tiff文件之一时,我们在日志文件中收到以下错误

Drop Box Link to Stack Trace 投递箱链接到堆栈跟踪

And this is one of the Tiff document we are trying to have ephesoft learn 这是我们试图让Ephesoft学习的Tiff文档之一

Drop Box Link to Tiff Document 投递箱链接到Tiff文档

Is there something that I can do with ghostscript, imagemagick or any other software to fix this; 我可以使用ghostscript,imagemagick或任何其他软件来解决此问题吗? or do I need to modify ephesoft in some way? 还是我需要以某种方式修改Ephesoft?

I found the solution by doing some more research. 我通过做更多的研究找到了解决方案。

The problem didn't involve Ghostscript or Imagmagick. 问题不涉及Ghostscript或Imagmagick。 It involved Tesseract and creating the HOCR file. 它涉及Tesseract和创建HOCR文件。 When Tesseract is creating the hocr file it is resolving the value of Texas as Te>. 当Tesseract创建hocr文件时,它会将Texas的值解析为Te>。 The community edition of Ephesoft cannot handle the special xml character like that and would throw the error as a result. Ephesoft的社区版本无法像这样处理特殊的xml字符,结果将引发错误。

The solution was to set a Tesseract property of blacklisting the <> symbols so that Tesseract would not include those or resolve to those. 解决方案是将Tesseract属性设置为将<>符号列入黑名单,以使Tesseract不包含这些符号或将其解析为。 My PDF's seem to be working correctly now and I am able to process them. 我的PDF现在似乎可以正常工作,我可以对其进行处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 来自 single_files.tiff 文件夹的批处理 multipage.pdf - batch multipage.pdf from folder of single_files.tiff 如何使用露天和tesseact将tiff转换为可搜索的pdf? - How to convert tiff to searchable pdf using alfresco and tesseact? Python-Tika 返回 PDF 的“无”内容,但适用于 TIFF - Python-Tika returning "None" content for PDF's, but works with TIFF's Ephesoft无法学习或提取某些tif图像(并非所有图像) - Ephesoft can't learn or extract certain tif image (not all image) Tess4j - Pdf 到 Tiff 到 tesseract -“警告:无效分辨率 0 dpi。改用 70。” - Tess4j - Pdf to Tiff to tesseract - "Warning: Invalid resolution 0 dpi. Using 70 instead." 如何从图像生成tiff / box文件以在Windows中训练Tesseract - How to generate a tiff/box file from an image to train Tesseract in Windows 如何从MagickWand对象(在C语言中)获取libtiff TIFF对象? - How can I get a libtiff TIFF object from a MagickWand object (in C)? 使用tesseract v3通过可搜索的文本命令行创建pdf文档 - use tesseract v3 to create pdf documents with searcheable text command line 多页Tiff的Tesseract训练 - Tesseract training with multipage tiff 如何使用python从扫描的文档中提取文本 - how to extract text from scanned documents using python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM