简体繁体 English

Tika与OCR PDF上的Tesseract一起使用时会复制文本

[英]Tika duplicates text when used with Tesseract on OCR PDF

原文 2017-02-20 15:41:05 8 1 pdf/ ocr/ tesseract/ apache-tika

I have a scanned PDF that has been OCRed and now has double layer of a scanned image and a text above it. 我有一个扫描的PDF，它已经是OCRed，现在具有扫描图像的双层以及上面的文本。

If I use Tika with integrated Tesseract to extract text from that PDF I get duplicate text: one comes from OCRed text and another from OCRing image by Tesseract. 如果我将Tika与集成的Tesseract结合使用以从该PDF中提取文本，则会得到重复的文本：一个来自OCRed文本，另一个来自Tesseract的OCRing图像。

I need only OCRed text in this case. 在这种情况下，我只需要OCRed文本。

I can't just disable Tesseract because there may be PDFs containing only images or PDFs that contain text and images. 我不能仅禁用Tesseract，因为可能有一些PDF仅包含图像，或者包含文本和图像的PDF。

Tesseract is integrated in Tika like in Apache Tika extract scanned PDF files Tesseract与Apache Tika一样集成在Tika中，提取扫描的PDF文件

Is there any way to tell Tika to not use Tesseract for images inside PDF that have OCR text over them? 有什么方法可以告诉Tika不要将Tesseract用于PDF内带有OCR文本的图像？

1 个解决方案

我们有一个类似的问题，我们试图保持一个简单的if else条件，将pdf传递给默认的pdf扫描仪，如果它变成空，则在pdf上使用tesseract选项进行调用。

使用 Tesseract OCR 将 PDF 转换为文本 - Converting a PDF to text using Tesseract OCR

使用 Tesseract OCR 从扫描的 pdf 个文件夹中提取文本 - Use Tesseract OCR to extract text from a scanned pdf folders

tesseract ocr 多页 pdf 挂起 - tesseract ocr multipage pdf hangs

为Tesseract准备PDF时的文本质量 - Text quality when preparing a PDF for Tesseract

使用Tika从大型pdf中提取文本 - Extract text from a large pdf with Tika

OCR库可以将OCR的文本重新插入到源PDF中 - OCR library that can insert OCR'd text back into the source PDF

快速检查PDF文件上的OCR文本层 - quickly inspect OCR text layer on PDF file

如何在PDF中添加隐藏的OCR文本 - How to add hidden ocr Text in PDF

Java - 使用 OCR 从 PDF 中提取文本 - Java - Text Extraction from PDF using OCR

当我从可搜索的 pdf 文件（使用 tesseract 命令创建）复制它并将其粘贴到记事本中时，文本正在被更改 - text is being changed when i do copy it from searchable pdf file (created with tesseract command) and paste it in notepad

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 Tesseract OCR 将 PDF 转换为文本 - Converting a PDF to text using Tesseract OCR 使用 Tesseract OCR 从扫描的 pdf 个文件夹中提取文本 - Use Tesseract OCR to extract text from a scanned pdf folders tesseract ocr 多页 pdf 挂起 - tesseract ocr multipage pdf hangs 为Tesseract准备PDF时的文本质量 - Text quality when preparing a PDF for Tesseract 使用Tika从大型pdf中提取文本 - Extract text from a large pdf with Tika OCR库可以将OCR的文本重新插入到源PDF中 - OCR library that can insert OCR'd text back into the source PDF 快速检查PDF文件上的OCR文本层 - quickly inspect OCR text layer on PDF file 如何在PDF中添加隐藏的OCR文本 - How to add hidden ocr Text in PDF Java - 使用 OCR 从 PDF 中提取文本 - Java - Text Extraction from PDF using OCR 当我从可搜索的 pdf 文件（使用 tesseract 命令创建）复制它并将其粘贴到记事本中时，文本正在被更改 - text is being changed when i do copy it from searchable pdf file (created with tesseract command) and paste it in notepad

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM