简体   繁体   English

履带阅读pdf

[英]Crawler reading a pdf

i am trying to create a crawler that can read a pdf and extract certain information from it (to save in a database). 我正在尝试创建一个可以读取pdf并从中提取某些信息(以保存在数据库中)的搜寻器。

However, i am unsure which method / Tool to use. 但是,我不确定要使用哪种方法/工具。

My initial thought was to use PhantomJs but after reading a lot it doesn't seem that it has the capabilities. 我最初的想法是使用PhantomJs,但经过大量阅读后,似乎没有它的功能。 if I wanted to use Phantomjs I would have to download the pdf, convert it into an HTML page and then afterwards crawl it using Phantom which seems like a tedious task that should be able to be done faster. 如果我想使用Phantomjs,则必须下载pdf,将其转换为HTML页面,然后再使用Phantom对其进行爬网,这似乎是一个繁琐的任务,应该可以更快地完成。

So my question is, how can I read a pdf from an online source and gather these pieces of information? 所以我的问题是,我如何从在线资源中读取pdf并收集这些信息?

If you are not limited in terms of programming language, consider using iText. 如果您不受编程语言的限制,请考虑使用iText。 It can easily extract all the text from a given PDF document. 它可以轻松地从给定的PDF文档中提取所有文本。 It also offer utility methods to look for regular expressions within a file, giving you back the exact location (coordinates) and the matching text. 它还提供实用程序方法来查找文件中的正则表达式,从而为您提供确切的位置(坐标)和匹配的文本。

iText is available both for c# and java lovers. iText适用于C#和Java爱好者。

File inputFile = new File("");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
String content = PdfTextExtractor.getTextFromPage(pdfDocument.getPage(1));

Have a look at the website to learn more. 查看网站以了解更多信息。 http://developers.itextpdf.com/content/itext-7-examples/itext-7-content-extraction-and-redaction http://developers.itextpdf.com/content/itext-7-examples/itext-7-content-extraction-and-redaction

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM