简体繁体 English

自动从pdf中提取许多文件的文本

[英]automatically extract text from pdf for many files

原文 2013-04-22 17:20:52 3 3 java/ python/ pdf/ text

I have about 10,000 of pdf files(conf papers) and I need to extract text from certain section (like the experimental section) of these papers and save in a file. 我大约有10,000个pdf文件（conf论文），我需要从这些论文的某些部分（如实验部分）中提取文本并保存在文件中。 Does anyone know a java tool or some python tool which can help me do this? 有谁知道一个Java工具或某些Python工具可以帮助我做到这一点？

Thanks in advance 提前致谢

Ayush 阿育

3 个解决方案

Did you research your question before posting? 在发布之前，您是否研究过您的问题？ I just googled and found this Apache project: http://pdfbox.apache.org/ 我刚刚在Google上搜索并找到了这个Apache项目： http : //pdfbox.apache.org/

For java: have a look at iText 对于Java：看看iText

For python I would use PDFMiner 对于python，我将使用PDFMiner

Since these are academic papers, you should also really look at lapdftext 由于这些都是学术论文，因此您还应该真正查看lapdftext

LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance where needed). LA-PDFText是一个用于从基于PDF的研究文章中提取准确文本的系统（以及一个可以在需要时提高性能的界面）。 The system is open-source and provides a simple baseline function for extracting text from primary research articles using rules that developers can customize. 该系统是开源的，并提供了简单的基线功能，可使用开发人员可以自定义的规则从主要研究文章中提取文本。