如何使用 pdfclown 提高文件上突出显示的搜索关键字的性能

Question

I am using pdfclown and below code is taking around 100 seconds to highlighting search keywords in same file.Kindly provide your inputs for improving performance in below code.Please find the jar path in below url to run this code.我正在使用 pdfclown 和以下代码需要大约 100 秒来突出显示同一文件中的搜索关键字。请提供您的输入以提高以下代码中的性能。请在以下 url 中找到 jar 路径以运行此代码。 https://drive.google.com/drive/folders/1nW8bk6bcAG6g7LZYy2YAAMk46hI9IPUh https://drive.google.com/drive/folders/1nW8bk6bcAG6g7LZYy2YAAMk46hI9IPUh

import java.awt.Color;
import java.awt.Desktop;
import java.awt.geom.Rectangle2D;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.UnsupportedEncodingException;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Date;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.File;
import org.pdfclown.documents.Page;
import org.pdfclown.documents.contents.ITextString;
import org.pdfclown.documents.contents.TextChar;
import org.pdfclown.documents.contents.colorSpaces.DeviceRGBColor;
import org.pdfclown.documents.interaction.annotations.TextMarkup;
import org.pdfclown.documents.interaction.annotations.TextMarkup.MarkupTypeEnum;

import org.pdfclown.files.SerializationModeEnum;
import org.pdfclown.util.math.Interval;
import org.pdfclown.util.math.geom.Quad;
import org.pdfclown.tools.TextExtractor;

public class pdfclown2 {
    private static int count;

    public static void main(String[] args) throws IOException {

        highlight("book.pdf","C:\\Users\\\Downloads\\6.pdf");
        System.out.println("OK");
    }
    private static void highlight(String inputPath, String outputPath) throws IOException {

        URL url = new URL(inputPath);
        InputStream in = url.openStream();
        org.pdfclown.files.File file = null;
        //"C:\\Users\\Desktop\\pdf\\80743064.pdf"
        try {
            file = new org.pdfclown.files.File("C:\\Users\\uc23\\Desktop\\pdf\\80743064.pdf);

        Map<String, String> m = new HashMap<String, String>();
    for(int i=0;i<3500;i++){

        if(i<=2){
        m.put("The","hi");
        m.put("know","hello");
        m.put("is","Welcome");
        }else{
            m.put(""+i,"hi");
        }
    }

        System.out.println("map size"+m.size());
         long startTime = System.currentTimeMillis();

        for (Map.Entry<String, String> entry : m.entrySet()) {

            Pattern pattern;
            String serachKey =  entry.getKey().toLowerCase();
            final String translationKeyword = entry.getValue();

                if ((serachKey.contains(")") && serachKey.contains("("))
                        || (serachKey.contains("(") && !serachKey.contains(")"))
                        || (serachKey.contains(")") && !serachKey.contains("(")) || serachKey.contains("?")
                        || serachKey.contains("*") || serachKey.contains("+")) {
                    pattern = Pattern.compile(Pattern.quote(serachKey), Pattern.CASE_INSENSITIVE);
                }
                else
                     pattern = Pattern.compile( "\\b"+serachKey+"\\b", Pattern.CASE_INSENSITIVE);


            // 2. Iterating through the document pages...
            TextExtractor textExtractor = new TextExtractor(true, true);
            for (final Page page : file.getDocument().getPages()) {
                // 2.1. Extract the page text!
                Map<Rectangle2D, List<ITextString>> textStrings = textExtractor.extract(page);
            //System.out.println(textStrings.toString().indexOf(entry.getKey()));

                // 2.2. Find the text pattern matches!
                final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings).toLowerCase());
                // 2.3. Highlight the text pattern matches!
                textExtractor.filter(textStrings, new TextExtractor.IIntervalFilter() {
                    public boolean hasNext() {
                        // System.out.println(matcher.find());
                        // if(key.getMatchCriteria() == 1){
                        if (matcher.find()) {
                            return true;
                        }
                        /*
                         * } else if(key.getMatchCriteria() == 2) { if
                         * (matcher.hitEnd()) { count++; return true; } }
                         */
                        return false;

                    }

                    public Interval<Integer> next() {
                        return new Interval<Integer>(matcher.start(), matcher.end());
                    }

                    public void process(Interval<Integer> interval, ITextString match) {
                        // Defining the highlight box of the text pattern
                        // match...
                        System.out.println(match);
                        List<Quad> highlightQuads = new ArrayList<Quad>();
                        {
                            Rectangle2D textBox = null;
                            for (TextChar textChar : match.getTextChars()) {
                                Rectangle2D textCharBox = textChar.getBox();
                                if (textBox == null) {
                                    textBox = (Rectangle2D) textCharBox.clone();
                                } else {
                                    if (textCharBox.getY() > textBox.getMaxY()) {
                                        highlightQuads.add(Quad.get(textBox));
                                        textBox = (Rectangle2D) textCharBox.clone();
                                    } else {
                                        textBox.add(textCharBox);
                                    }
                                }
                            }
                            textBox.setRect(textBox.getX(), textBox.getY(), textBox.getWidth(), textBox.getHeight());
                            highlightQuads.add(Quad.get(textBox));
                        }

                        new TextMarkup(page, highlightQuads, translationKeyword, MarkupTypeEnum.Highlight);

                    }

                    public void remove() {
                        throw new UnsupportedOperationException();
                    }

                });
            }

        }

        SerializationModeEnum serializationMode = SerializationModeEnum.Incremental;

            file.save(new java.io.File(outputPath), serializationMode);

            System.out.println("file created");
            long endTime = System.currentTimeMillis();

             System.out.println("seconds take for execution is:"+(endTime-startTime)/1000);

        } catch (Exception e) {
               e.printStackTrace();
        }
        finally{
            in.close();
        }


    }
}

Answer 1

My guess is that process is the bottle neck, which can be easily tested (comment the code out).我的猜测是这个process是瓶颈，它可以很容易地测试（注释掉代码）。 Measure times.测量时间。 A good time for profiling the application.分析应用程序的好时机。

A simple heuristic optimisation: taking the first and last TextChar rectangles for one liners, and considering font ascenders and descenders, one could create ab entire rectangle.一个简单的启发式优化：将第一个和最后一个 TextChar 矩形作为一个行，并考虑字体上升和下降，可以创建一个完整的矩形。 That would already speed things up.这已经可以加快速度了。

Alternatives probably exist.可能存在替代方案。 Place a more specific question.提出一个更具体的问题。

Further improvements:进一步改进：

    InputStream in = url.openStream();

should be应该

    InputStream in = new BufferedInputStream(url.openStream());

And the multiply searchKey.contains might possibly be a Pattern declared before the loop.并且乘法 searchKey.contains 可能是在循环之前声明的模式。

The same technique might be done for the original highlighting code, but then multi-line support should be added, a Quad for every line.可以对原始突出显示代码执行相同的技术，但随后应添加多行支持，每行一个 Quad。

The textExtractor is reused for every page which seems the fastest way, but try declare it in the page loop. textExtractor 可用于每个页面，这似乎是最快的方式，但请尝试在页面循环中声明它。

I hope you get a more concrete answer, though I doubt it, hence this one.我希望你得到一个更具体的答案，尽管我对此表示怀疑，因此是这个。 Better would have been to isolate the slow code from the entirety.最好将慢代码与整体隔离开来。 But I understand the wish for overall performance gain.但我理解整体性能提升的愿望。

A less precise, maybe faster highlight code:一个不太精确但可能更快的高亮代码：

                    List<TextChar> textChars = match.getTextChars();
                    Rectangle2D firstRect = textChars.get(0).getBox();
                    Rectangle2D lastRect = textChars.get(textChars.size() - 1).getBox();
                    Rectangle2D rect = firstRect.createUnion(lastRect);
                    highlightQuads.add(Quad.get(rect));

After other comment在其他评论之后

It seems that the bottle neck lies elsewhere.瓶颈似乎在别处。 My guess is the text extraction then: so invert the two loops:我的猜测是文本提取然后：所以反转两个循环：

TextExtractor textExtractor = new TextExtractor(true, true);
for (final Page page : file.getDocument().getPages()) {

    for (Map.Entry<String, String> entry : m.entrySet()) {
        Pattern pattern;
        String serachKey =  entry.getKey().toLowerCase();
        final String translationKeyword = entry.getValue();

        if ((serachKey.contains(")") && serachKey.contains("("))
                    || (serachKey.contains("(") && !serachKey.contains(")"))
                    || (serachKey.contains(")") && !serachKey.contains("(")) || serachKey.contains("?")
                    || serachKey.contains("*") || serachKey.contains("+")) {
                pattern = Pattern.compile(Pattern.quote(serachKey), Pattern.CASE_INSENSITIVE);
        }
        else
             pattern = Pattern.compile( "\\b"+serachKey+"\\b", Pattern.CASE_INSENSITIVE);

It probably makes sense to have a map of Pattern as Pattern.compile is slow.拥有Pattern的映射可能是有意义的，因为Pattern.compile很慢。

And then I am out of ideas / have other things to do.然后我没有想法/还有其他事情要做。

如何使用 pdfclown 提高文件上突出显示的搜索关键字的性能

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-02-21 08:32:25

如何使用 pdfclown 提高文件上突出显示的搜索关键字的性能

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-02-21 08:32:25

解决方案1
1 已采纳 2018-02-21 08:32:25