简体   繁体   English

从pdf导出数据的最佳方法

[英]best way to export data from pdfs

Hi i work at a news paper and we are lookin a way to make archieve material available. 嗨,我在新闻报纸上工作,我们正在寻找一种提供存档材料的方法。 Atm our pages come in pdf format so we need a way to export text and images from the pdf so that they can be added to a database. Atm我们的页面采用pdf格式,因此我们需要一种从pdf导出文本和图像的方法,以便可以将它们添加到数据库中。 We've had a look at the News studio plugin for Adobe Acrobat from Iceni Technology, but just wondering if anyone else knows other options for exporting pdf data. 我们已经看过Iceni Technology的Adobe Acrobat的News studio插件,但是只是想知道是否有人知道导出pdf数据的其他选项。 thanks 谢谢

There is pdftotext (part of xpdf ). pdftotextxpdf的一部分)。 It will extract text from PDF files (if it is stored as text in the PDF, and not as an image). 它将从PDF文件中提取文本(如果以文本形式存储在PDF中,而不是作为图像存储)。 You could probably use that. 您可能会使用它。

However, be advised that any solution to extract text from PDF will be limited, as PDFs are really for display only. 但是,请注意,从PDF提取文本的任何解决方案都会受到限制,因为PDF实际上仅用于显示。 At the very least, you will not have metadata like article date, author etc.; 至少,您不会有文章日期,作者等元数据; also, if part of the text is in an image, you might lose that. 同样,如果文本的一部分在图像中,则可能会丢失它。

The better approach is probably to extract the raw data from the system which generates the PDFs, and archive that in a suitable format. 更好的方法可能是从生成PDF的系统中提取原始数据,并以合适的格式将其存档。 Maybe more work, but better results. 也许需要更多的工作,但是效果更好。

If your pdfs already contain the text, then your job will be much easier: tools like pdftotext and pdftohtml will give you image and text output (see the Ubuntu package xpdf-utils). 如果您的pdf文件已经包含文本,那么您的工作将会更加容易:pdftotext和pdftohtml之类的工具将为您提供图像和文本输出(请参阅Ubuntu软件包xpdf-utils)。

On the other hand, if the text in your pdf is image-based then you'll have to look at OCR options. 另一方面,如果pdf中的文本是基于图像的,则必须查看OCR选项。 Fortunately, there are some good open source offerings. 幸运的是,有一些不错的开源产品。 I have had some success using a combination of ImageMagick and Tesseract : 使用ImageMagickTesseract的结合,我取得了一些成功:

  1. First, convert PDFs to TIFF with ImageMagick (Tesseract won't OCR PDFs) 首先,使用ImageMagick将PDF转换为TIFF(Tesseract不会OCR PDF)
  2. OCR the TIFF using Tesseract (you can also try gocr , also available in the Ubuntu repos) 使用Tesseract对TIFF进行OCR(您也可以尝试gocr ,也可以在Ubuntu仓库中找到)

The key was to make sure the TIFFs were high enough enough quality. 关键是要确保TIFF的质量足够高。 These ImageMagick settings worked well for me: 这些ImageMagick设置对我来说效果很好:

convert -depth 8 -density 500 -colorspace GRAY -resize 1600 input.pdf output.tif

If you need to extract metadata from a pdf as well (Title, Location, Subject, Author, etc.) then pdftk is a useful tool. 如果您还需要从pdf中提取元数据(标题,位置,主题,作者等),则pdftk是一个有用的工具。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM