如何使用 pdfclown 提高文件上突出顯示的搜索關鍵字的性能

Question

我正在使用 pdfclown 和以下代碼需要大約 100 秒來突出顯示同一文件中的搜索關鍵字。請提供您的輸入以提高以下代碼中的性能。請在以下 url 中找到 jar 路徑以運行此代碼。 https://drive.google.com/drive/folders/1nW8bk6bcAG6g7LZYy2YAAMk46hI9IPUh

import java.awt.Color;
import java.awt.Desktop;
import java.awt.geom.Rectangle2D;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.UnsupportedEncodingException;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Date;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.File;
import org.pdfclown.documents.Page;
import org.pdfclown.documents.contents.ITextString;
import org.pdfclown.documents.contents.TextChar;
import org.pdfclown.documents.contents.colorSpaces.DeviceRGBColor;
import org.pdfclown.documents.interaction.annotations.TextMarkup;
import org.pdfclown.documents.interaction.annotations.TextMarkup.MarkupTypeEnum;

import org.pdfclown.files.SerializationModeEnum;
import org.pdfclown.util.math.Interval;
import org.pdfclown.util.math.geom.Quad;
import org.pdfclown.tools.TextExtractor;

public class pdfclown2 {
    private static int count;

    public static void main(String[] args) throws IOException {

        highlight("book.pdf","C:\\Users\\\Downloads\\6.pdf");
        System.out.println("OK");
    }
    private static void highlight(String inputPath, String outputPath) throws IOException {

        URL url = new URL(inputPath);
        InputStream in = url.openStream();
        org.pdfclown.files.File file = null;
        //"C:\\Users\\Desktop\\pdf\\80743064.pdf"
        try {
            file = new org.pdfclown.files.File("C:\\Users\\uc23\\Desktop\\pdf\\80743064.pdf);

        Map<String, String> m = new HashMap<String, String>();
    for(int i=0;i<3500;i++){

        if(i<=2){
        m.put("The","hi");
        m.put("know","hello");
        m.put("is","Welcome");
        }else{
            m.put(""+i,"hi");
        }
    }

        System.out.println("map size"+m.size());
         long startTime = System.currentTimeMillis();

        for (Map.Entry<String, String> entry : m.entrySet()) {

            Pattern pattern;
            String serachKey =  entry.getKey().toLowerCase();
            final String translationKeyword = entry.getValue();

                if ((serachKey.contains(")") && serachKey.contains("("))
                        || (serachKey.contains("(") && !serachKey.contains(")"))
                        || (serachKey.contains(")") && !serachKey.contains("(")) || serachKey.contains("?")
                        || serachKey.contains("*") || serachKey.contains("+")) {
                    pattern = Pattern.compile(Pattern.quote(serachKey), Pattern.CASE_INSENSITIVE);
                }
                else
                     pattern = Pattern.compile( "\\b"+serachKey+"\\b", Pattern.CASE_INSENSITIVE);


            // 2. Iterating through the document pages...
            TextExtractor textExtractor = new TextExtractor(true, true);
            for (final Page page : file.getDocument().getPages()) {
                // 2.1. Extract the page text!
                Map<Rectangle2D, List<ITextString>> textStrings = textExtractor.extract(page);
            //System.out.println(textStrings.toString().indexOf(entry.getKey()));

                // 2.2. Find the text pattern matches!
                final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings).toLowerCase());
                // 2.3. Highlight the text pattern matches!
                textExtractor.filter(textStrings, new TextExtractor.IIntervalFilter() {
                    public boolean hasNext() {
                        // System.out.println(matcher.find());
                        // if(key.getMatchCriteria() == 1){
                        if (matcher.find()) {
                            return true;
                        }
                        /*
                         * } else if(key.getMatchCriteria() == 2) { if
                         * (matcher.hitEnd()) { count++; return true; } }
                         */
                        return false;

                    }

                    public Interval<Integer> next() {
                        return new Interval<Integer>(matcher.start(), matcher.end());
                    }

                    public void process(Interval<Integer> interval, ITextString match) {
                        // Defining the highlight box of the text pattern
                        // match...
                        System.out.println(match);
                        List<Quad> highlightQuads = new ArrayList<Quad>();
                        {
                            Rectangle2D textBox = null;
                            for (TextChar textChar : match.getTextChars()) {
                                Rectangle2D textCharBox = textChar.getBox();
                                if (textBox == null) {
                                    textBox = (Rectangle2D) textCharBox.clone();
                                } else {
                                    if (textCharBox.getY() > textBox.getMaxY()) {
                                        highlightQuads.add(Quad.get(textBox));
                                        textBox = (Rectangle2D) textCharBox.clone();
                                    } else {
                                        textBox.add(textCharBox);
                                    }
                                }
                            }
                            textBox.setRect(textBox.getX(), textBox.getY(), textBox.getWidth(), textBox.getHeight());
                            highlightQuads.add(Quad.get(textBox));
                        }

                        new TextMarkup(page, highlightQuads, translationKeyword, MarkupTypeEnum.Highlight);

                    }

                    public void remove() {
                        throw new UnsupportedOperationException();
                    }

                });
            }

        }

        SerializationModeEnum serializationMode = SerializationModeEnum.Incremental;

            file.save(new java.io.File(outputPath), serializationMode);

            System.out.println("file created");
            long endTime = System.currentTimeMillis();

             System.out.println("seconds take for execution is:"+(endTime-startTime)/1000);

        } catch (Exception e) {
               e.printStackTrace();
        }
        finally{
            in.close();
        }


    }
}

Answer 1

我的猜測是這個process是瓶頸，它可以很容易地測試（注釋掉代碼）。 測量時間。 分析應用程序的好時機。

一個簡單的啟發式優化：將第一個和最后一個 TextChar 矩形作為一個行，並考慮字體上升和下降，可以創建一個完整的矩形。 這已經可以加快速度了。

可能存在替代方案。 提出一個更具體的問題。

進一步改進：

    InputStream in = url.openStream();

應該

    InputStream in = new BufferedInputStream(url.openStream());

並且乘法 searchKey.contains 可能是在循環之前聲明的模式。

可以對原始突出顯示代碼執行相同的技術，但隨后應添加多行支持，每行一個 Quad。

textExtractor 可用於每個頁面，這似乎是最快的方式，但請嘗試在頁面循環中聲明它。

我希望你得到一個更具體的答案，盡管我對此表示懷疑，因此是這個。 最好將慢代碼與整體隔離開來。 但我理解整體性能提升的願望。

一個不太精確但可能更快的高亮代碼：

                    List<TextChar> textChars = match.getTextChars();
                    Rectangle2D firstRect = textChars.get(0).getBox();
                    Rectangle2D lastRect = textChars.get(textChars.size() - 1).getBox();
                    Rectangle2D rect = firstRect.createUnion(lastRect);
                    highlightQuads.add(Quad.get(rect));

在其他評論之后

瓶頸似乎在別處。 我的猜測是文本提取然后：所以反轉兩個循環：

TextExtractor textExtractor = new TextExtractor(true, true);
for (final Page page : file.getDocument().getPages()) {

    for (Map.Entry<String, String> entry : m.entrySet()) {
        Pattern pattern;
        String serachKey =  entry.getKey().toLowerCase();
        final String translationKeyword = entry.getValue();

        if ((serachKey.contains(")") && serachKey.contains("("))
                    || (serachKey.contains("(") && !serachKey.contains(")"))
                    || (serachKey.contains(")") && !serachKey.contains("(")) || serachKey.contains("?")
                    || serachKey.contains("*") || serachKey.contains("+")) {
                pattern = Pattern.compile(Pattern.quote(serachKey), Pattern.CASE_INSENSITIVE);
        }
        else
             pattern = Pattern.compile( "\\b"+serachKey+"\\b", Pattern.CASE_INSENSITIVE);

擁有Pattern的映射可能是有意義的，因為Pattern.compile很慢。

然后我沒有想法/還有其他事情要做。

如何使用 pdfclown 提高文件上突出顯示的搜索關鍵字的性能

問題描述

1 個解決方案

解決方案1
1 已采納 2018-02-21 08:32:25

如何使用 pdfclown 提高文件上突出顯示的搜索關鍵字的性能

問題描述

1 個解決方案

解決方案1 1 已采納 2018-02-21 08:32:25

解決方案1
1 已采納 2018-02-21 08:32:25