简体繁体 English

使用可点击的内容页面解析pdf文件

[英]parsing a pdf file with clickable contents page

原文 2012-12-30 20:40:02 4 2 c#/ c#-4.0/ pdf/ pdf-parsing

Let's say we have a pdf file that has clickable contents page. 假设我们有一个包含可点击内容页面的pdf文件。 (I am talking about chapters and subchapters) How can that certain file be parsed in C# and how can an application realize whether the pdf it is reading has or has not chapters/contents etc? （我正在讨论章节和子章节）如何在C＃中解析某个文件，以及应用程序如何识别它正在阅读的pdf是否有章节/内容等？

This is a link to a pdf without clickable table of contents https://docs.google.com/open?id=0B1EbI-EMJxmkODE1Mm5WbFpEdXc I did not seem to find a pdf with clickable table of contents but I found a guide on how to do it here http://everythingyoumightneed.blogspot.com/2013/01/how-to-create-pdf-with-clickable-links.html 这是一个PDF格式的链接没有内容可点击表https://docs.google.com/open?id=0B1EbI-EMJxmkODE1Mm5WbFpEdXc我似乎没有找到与内容可点击表中的PDF格式，但我发现了如何引导在这里做http://everythingyoumightneed.blogspot.com/2013/01/how-to-create-pdf-with-clickable-links.html

So my question is: How can an app differentiate which is which and how can the one with clickable links be parsed? 所以我的问题是：应用程序如何区分具有可点击链接的哪个以及如何被解析？

2 个解决方案

Your problem is not dissimilar to trying to figure out where paragraphs and columns are in PDF files; 您的问题与尝试找出PDF文件中段落和列的位置并没有什么不同; PDF doesn't typically label a table of contents page as such. PDF通常不会标记目录页面。 So even with a PDF library (such as iTextSharp pointed out by mkl), this will not be a trivial task. 因此，即使使用PDF库（例如mkl指出的iTextSharp），这也不是一项简单的任务。

With such a library, you will be able to see the pages in the PDF file and the text on the pages. 使用这样的库，您将能够看到PDF文件中的页面和页面上的文本。 However, if this is a book for example, the table of contents page may be the first, second, third or xth page in the PDF file because of various other pages appearing in front of it (cover, second cover, copyright, tributes, you name it...). 但是，如果这是一本书，例如，目录页面可能是PDF文件中的第一页，第二页，第三页或第x页，因为它前面出现了各种其他页面（封面，第二封面，版权，贡品，你说它的名字......）。

So an algorithm to discover whether there is a table of content would have to be able to discover it somewhere in the first x pages of the PDF file. 因此，发现是否存在内容表的算法必须能够在PDF文件的前x个页面中的某处发现它。 As there are no standard tags highlighting the text in the table of contents, this would have to be done through analysis of the format of the text on that page. 由于没有标注标签突出显示目录中的文本，因此必须通过分析该页面上文本的格式来完成。

There are two things that could be of help (if they are available): 有两件事可能会有所帮助（如果有的话）：

1) In many PDF files the items in a table are contents are like you say clickable. 1）在许多PDF文件中，表格中的项目就像你说的可点击一样。 So you could look in the PDF file and try to find the first page that contains a lot of hyperlinked items. 因此，您可以查看PDF文件并尝试查找包含大量超链接项的第一页。

2) In many PDF file the table of contents is mirrored in bookmarks. 2）在许多PDF文件中，目录在书签中被镜像。 So you could also examine the bookmarks structure and see if you can use that to figure out how many chapters there are in the book. 因此，您还可以检查书签结构，看看是否可以使用它来确定书中有多少章节。

Keep in mind that both of these features are optional and not standardizes if they are present. 请记住，这两个功能都是可选的，如果存在则不标准化。

Since PDF is an binary format you'll have to use a pdf-library like pdflib in order to read pdf-files. 由于PDF是二进制格式，因此您必须使用像pdflib这样的pdf库来读取pdf文件。

pdfLib PDFLIB

also you may want to check out this CodeProject site for some examples Converting PDF to Text in C# 您也可以查看此CodeProject站点以获取一些示例在C＃中将PDF转换为文本