履带阅读pdf

Question

i am trying to create a crawler that can read a pdf and extract certain information from it (to save in a database). 我正在尝试创建一个可以读取pdf并从中提取某些信息（以保存在数据库中）的搜寻器。

However, i am unsure which method / Tool to use. 但是，我不确定要使用哪种方法/工具。

My initial thought was to use PhantomJs but after reading a lot it doesn't seem that it has the capabilities. 我最初的想法是使用PhantomJs，但经过大量阅读后，似乎没有它的功能。 if I wanted to use Phantomjs I would have to download the pdf, convert it into an HTML page and then afterwards crawl it using Phantom which seems like a tedious task that should be able to be done faster. 如果我想使用Phantomjs，则必须下载pdf，将其转换为HTML页面，然后再使用Phantom对其进行爬网，这似乎是一个繁琐的任务，应该可以更快地完成。

So my question is, how can I read a pdf from an online source and gather these pieces of information? 所以我的问题是，我如何从在线资源中读取pdf并收集这些信息？

Answer 1

If you are not limited in terms of programming language, consider using iText. 如果您不受编程语言的限制，请考虑使用iText。 It can easily extract all the text from a given PDF document. 它可以轻松地从给定的PDF文档中提取所有文本。 It also offer utility methods to look for regular expressions within a file, giving you back the exact location (coordinates) and the matching text. 它还提供实用程序方法来查找文件中的正则表达式，从而为您提供确切的位置（坐标）和匹配的文本。

iText is available both for c# and java lovers. iText适用于C＃和Java爱好者。

File inputFile = new File("");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
String content = PdfTextExtractor.getTextFromPage(pdfDocument.getPage(1));

Have a look at the website to learn more. 查看网站以了解更多信息。 http://developers.itextpdf.com/content/itext-7-examples/itext-7-content-extraction-and-redaction http://developers.itextpdf.com/content/itext-7-examples/itext-7-content-extraction-and-redaction

履带阅读pdf

问题描述

1 个解决方案

解决方案1
1 2017-09-05 11:54:17

履带阅读pdf

问题描述

1 个解决方案

解决方案1 1 2017-09-05 11:54:17

解决方案1
1 2017-09-05 11:54:17