简体   繁体   English

在 pdf 中搜索文本并获取 position 和 java

[英]Search texts and get position in pdf with java

How can I search for text and get position in pdf with java?如何使用 java 在 pdf 中搜索文本并获取 position? I tried with apache pdfbox and pdfclown but whenever the text goes down or start a new paragraph, it doesn't work.我尝试使用 apache pdfbox 和 pdfclown 但每当文本下降或开始新段落时,它都不起作用。 I want to get the same result like in the picture below.我想得到与下图相同的结果。

Thank you.谢谢你。

Desired result期望的结果

You referred to one of my earlier answers as an example for PDFBox which did not work for you.您将我之前的一个答案作为 PDFBox 的一个示例,它对您不起作用。 Indeed, as already explained in that answer it was a surprise to see that code match anything beyond single words as the callers of the routine overridden there gave the impression of calling it word-by-word.事实上,正如该答案中已经解释的那样,令人惊讶的是,代码匹配任何单词以外的任何内容,因为在那里重写的例程的调用者给人的印象是逐字调用它。 Thus, anything spanning more than a single line indeed hardly could be expected to be found.因此,实际上几乎不可能找到跨越多条线的任何东西。

But one can improve that example in quite a natural manner to allow searches across line borders, assuming lines are split at spaces.但是可以以一种非常自然的方式改进该示例,以允许跨行边界进行搜索,假设行在空格处分割。 Replace the method findSubwords by this improved version:用这个改进的版本替换方法findSubwords

List<TextPositionSequence> findSubwordsImproved(PDDocument document, int page, String searchTerm) throws IOException
{
    final List<TextPosition> allTextPositions = new ArrayList<>();
    PDFTextStripper stripper = new PDFTextStripper()
    {
        @Override
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException
        {
            allTextPositions.addAll(textPositions);
            super.writeString(text, textPositions);
        }

        @Override
        protected void writeLineSeparator() throws IOException {
            if (!allTextPositions.isEmpty()) {
                TextPosition last = allTextPositions.get(allTextPositions.size() - 1);
                if (!" ".equals(last.getUnicode())) {
                    Matrix textMatrix = last.getTextMatrix().clone();
                    textMatrix.setValue(2, 0, last.getEndX());
                    textMatrix.setValue(2, 1, last.getEndY());
                    TextPosition separatorSpace = new TextPosition(last.getRotation(), last.getPageWidth(), last.getPageHeight(),
                            textMatrix, last.getEndX(), last.getEndY(), last.getHeight(), 0, last.getWidthOfSpace(), " ",
                            new int[] {' '}, last.getFont(), last.getFontSize(), (int) last.getFontSizeInPt());
                    allTextPositions.add(separatorSpace);
                }
            }
            super.writeLineSeparator();
        }
    };
    
    stripper.setSortByPosition(true);
    stripper.setStartPage(page);
    stripper.setEndPage(page);
    stripper.getText(document);

    final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();
    TextPositionSequence word = new TextPositionSequence(allTextPositions);
    String string = word.toString();

    int fromIndex = 0;
    int index;
    while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
    {
        hits.add(word.subSequence(index, index + searchTerm.length()));
        fromIndex = index + 1;
    }

    return hits;
}

( SearchSubword method) ( SearchSubword方法)

Here we collect all TextPosition entries, we actually even add virtual such entries representing a space whenever a line break is added by PDFBox.在这里,我们收集所有TextPosition条目,实际上我们甚至在 PDFBox 添加换行符时添加表示空格的虚拟条目。 As soon as the whole page is rendered, we search the collection of all these text positions.渲染整个页面后,我们搜索所有这些文本位置的集合。

Applied to the example document in the original question,应用于原始问题中的示例文档

looking for "${var 2}" now returns all 8 occurrences, also those split across lines:寻找"${var 2}"现在返回所有 8 次出现,也包括跨行的那些:

* Looking for '${var 2}' (improved)
  Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997
  Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74
  Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998
  Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81
  Page 1 at 164.39648, 357.28998 with width -46.081444 and last letter '}' at 112.46, 372.65
  Page 1 at 174.97762, 388.72998 with width -56.662575 and last letter '}' at 112.46, 404.09
  Page 1 at 153.74, 420.16998 with width -32.004005 and last letter '}' at 112.46, 435.65
  Page 1 at 162.99922, 451.61 with width -43.692017 and last letter '}' at 112.46, 467.21

The negative widths occur because the x coordinate of the end of the match is less than that of its start.出现负宽度是因为匹配结束的 x 坐标小于其开始的 x 坐标。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM