简体繁体 English

搜索文本 a PDF - 双结果

[英]Search for Text a PDF - double results

原文 2020-04-16 08:10:59 8 1 pdf/ pdf-generation

i have a question about search text in a PDF file in attach here: pdf shared link google drive .我有一个关于 PDF 文件中的搜索文本的问题，附在此处： pdf 共享链接 google drive 。 If I search text example "1500", I see 4 occurences but there are only 2 occurenes in page 2.....the same if I search text "musei" find 2 occurrences, but this text is only in page 1.如果我搜索文本示例“1500”，我看到 4 个出现，但第 2 页中只有 2 个出现......如果我搜索文本“musei”找到 2 个出现，但该文本仅在第 1 页中。

The research parse the single page and find all document text in every single page, because I have double results.该研究解析单页并在每一页中查找所有文档文本，因为我有双重结果。

Can anyone explain why happen this?谁能解释为什么会这样？ Did this PDF file generated in a particular way respect other where searching text is ok?这个 PDF 文件是否以特定方式生成，是否考虑到其他可以搜索文本的地方？

Thanks a lot非常感谢

1 个解决方案

That PDF is indeed special, each page contains the text of both pages.那PDF确实很特别，每一页都包含两页的文字。 On the first page the text from the second page is right of the right page border, and on the second page the text from the first page is left of the left page border.在第一页上，来自第二页的文本位于右页边框的右侧，而在第二页上，来自第一页的文本位于左页边框的左侧。 Furthermore, the contents of the respectively other page are additionally outside the clip area.此外，各个其他页面的内容还位于剪辑区域之外。

I enlarged the page boxes (media box, crop box, ...) of the first page to the right and of the second page to the left, and then marked all text ( Ctrl-A ) to show even the text outside the clip area, and you see:我将第一页的页面框（媒体框，裁剪框，...）放大到右侧和第二页的左侧，然后标记所有文本（ Ctrl-A ）以显示剪辑之外的文本区域，你会看到：

For text extraction that only extracts the text in the visible areas, you should restrict your text extraction routine to the crop box of the respective page.对于仅提取可见区域中的文本的文本提取，您应该将文本提取例程限制在相应页面的裁剪框内。