简体繁体 English

从PDF中提取页眉/页脚（以编程方式）

[英]Extract header/footer from PDF (programmatically)

原文 2013-10-15 09:15:48 4 1 python/ pdf/ document

Is this possible to extract the header and/or footer from a PDF document? 这可以从PDF文档中提取页眉和/或页脚吗？

As I tried a few options (including PDFMiner, the Ruby gem pdf-extract, study the PDF format specs), I'm starting to suspect that the header/footer information is not available whatsoever. 当我尝试一些选项（包括PDFMiner，Ruby gem pdf-extract，研究PDF格式规范）时，我开始怀疑页眉/页脚信息无法使用。

(I would like to do this from Python, if possible, but any other alternative is viable.) （如果可能的话，我想从Python中做到这一点，但任何其他替代方案都是可行的。）

1 个解决方案

Page headers and footers are not (at least not necessarily) located in some content part separate from the rest of the page content. 页眉和页脚不是（至少不一定）位于与页面内容的其余部分分开的某个内容部分中。 Thus, in general there is no way to reliably extract headers and footers from PDFs . 因此，通常无法从PDF中可靠地提取页眉和页脚 。

It is possible, though, to try and use heuristics which look at the whole PDF contents and try to guess what parts are headers and/or footers. 但是，可以尝试使用启发式查看整个PDF内容并尝试猜测哪些部分是页眉和/或页脚。

If the PDFs you want to analyze are fairly homogeneous, eg all produced by the same publisher and looking alike, this might be feasible. 如果您要分析的PDF是相当同质的，例如所有由同一出版商制作并且看起来相似，则这可能是可行的。 The more divers your source PDFs are, though, the more complex your heuristics likely will become and the less accurate the results will be. 但是，源PDF文件越多，您的启发式方法可能会越复杂，结果就越不准确。