pdfclown：如何覆蓋pdfclown中現有的突出顯示的關鍵字

Question

我在pdfclown中得到了要求，例如是否有很少的子字符串/與另一個關鍵詞匹配的關鍵詞，而突出顯示這些關鍵詞必須被覆蓋並且應該允許突出顯示完整關鍵詞。例如在下面的地圖中ETS關鍵詞是just.ETS的子字符串和Test.ETS關鍵字。 預期結果應類似於我們需要突出顯示完整關鍵字，例如just.ETS，Test.ETS，而不是ETS關鍵字及其彈出度量值。 。 ActualPdf和實際結果pdf 。 和罐子路徑。

Map<String, String> m = new HashMap<String, String>();
        map.put("ETS" , "Loss");
        map.put("Just. ETS" , "Net ");
        map.put("Test. ETS" , "Profit");

（注意：1。如果文件中已經突出了大號關鍵字，則與大號關鍵字匹配的小號關鍵字不允許突出顯示2。並忽略/取消顯示小關鍵字。）。

    import java.awt.Color;
    import java.awt.Desktop;
    import java.awt.geom.Rectangle2D;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.UnsupportedEncodingException;
    import java.net.URL;
    import java.nio.charset.Charset;
    import java.util.ArrayList;
    import java.util.Collection;
    import java.util.Date;
    import java.util.HashMap;
    import java.util.List;
    import java.util.Map;
    import java.util.concurrent.TimeUnit;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    import java.io.File;
    import org.pdfclown.documents.Page;
    import org.pdfclown.documents.contents.ITextString;
    import org.pdfclown.documents.contents.TextChar;
    import org.pdfclown.documents.contents.colorSpaces.DeviceRGBColor;
    import org.pdfclown.documents.interaction.annotations.TextMarkup;
    import org.pdfclown.documents.interaction.annotations.TextMarkup.MarkupTypeEnum;

    import org.pdfclown.files.SerializationModeEnum;
    import org.pdfclown.util.math.Interval;
    import org.pdfclown.util.math.geom.Quad;
    import org.pdfclown.tools.TextExtractor;

    public class pdfclown2 {
        private static int count;

        public static void main(String[] args) throws IOException {

            highlight("C:\\Users\\uc23\\Desktop\\pdf\\80743064.pdf","C:\\Users\\\Downloads\\6.pdf");
            System.out.println("OK");
        }
        private static void highlight(String inputPath, String outputPath) throws IOException {




   org.pdfclown.files.File file = null;

try {
    file = new org.pdfclown.files.File("C:\\Users\\uc239646\\Desktop\\test.pdf");

List<Keyword> l=new ArrayList<Keyword>();
Keyword k=new Keyword();
Keyword k1=new Keyword();
k1.setKey("Just. ETS");
k1.setValue("NET");
l.add(k1);
Keyword k2=new Keyword();
k2.setKey("Test. ETS");
k2.setValue("PROFIT");
l.add(k2);
k.setKey("ETS");
k.setValue("LOSS");
l.add(k);

 long startTime = System.currentTimeMillis();




    // 2. Iterating through the document pages...
    TextExtractor textExtractor = new TextExtractor(true, true);
    for (final Page page : file.getDocument().getPages()) {
        Map<Rectangle2D, List<ITextString>> textStrings = textExtractor.extract(page);
        for (Keyword e : l) {
            Pattern pattern;
            String serachKey =  e.getKey();
            final String translationKeyword = e.getValue();

                if ((serachKey.contains(")") && serachKey.contains("("))
                        || (serachKey.contains("(") && !serachKey.contains(")"))
                        || (serachKey.contains(")") && !serachKey.contains("(")) || serachKey.contains("?")
                        || serachKey.contains("*") || serachKey.contains("+")) {
                    pattern = Pattern.compile(Pattern.quote(serachKey), Pattern.CASE_INSENSITIVE);
                }
                else
                     pattern = Pattern.compile("\\b"+serachKey+"\\b", Pattern.CASE_INSENSITIVE);
        // 2.1. Extract the page text!

    //System.out.println(textStrings.toString().indexOf(entry.getKey()));

        // 2.2. Find the text pattern matches!
                        final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings).toLowerCase());
        // 2.3. Highlight the text pattern matches!
        //System.out.println(textStrings);
        textExtractor.filter(textStrings, new TextExtractor.IIntervalFilter() {

            public boolean hasNext() {
                // if(key.getMatchCriteria() == 1){
                if (matcher.find()) {
                    return true;
                }
                /*
                 * } else if(key.getMatchCriteria() == 2) { if
                 * 
                 * 
                 * 
                 * 
                 * 
                 * 
                 * 
                 * 
                 * (matcher.hitEnd()) { count++; return true; } }
                 */
                return false;

            }

            public Interval<Integer> next() {
                return new Interval<Integer>(matcher.start(), matcher.end());
            }

            public void process(Interval<Integer> interval, ITextString match) {
                System.out.println(match);
                // Defining the highlight box of the text pattern
                // match...
                /*List l=new ArrayList();
                if(!l.contains(match)){
                    System.out.println("map.put("+match+","+translationKeyword+")");
                }
            */
                List<Quad> highlightQuads = new ArrayList<Quad>();
                {
                    Rectangle2D textBox = null;
                    for (TextChar textChar : match.getTextChars()) {
                        Rectangle2D textCharBox = textChar.getBox();
                        if (textBox == null) {
                            textBox = (Rectangle2D) textCharBox.clone();
                        } else {
                            if (textCharBox.getY() > textBox.getMaxY()) {
                                highlightQuads.add(Quad.get(textBox));
                                textBox = (Rectangle2D) textCharBox.clone();
                            } else {
                                textBox.add(textCharBox);
                            }
                        }

                    System.out.println(highlightQuads.contains(textBox));

                    textBox.setRect(textBox.getX(), textBox.getY(), textBox.getWidth(), textBox.getHeight());
                    highlightQuads.add(Quad.get(textBox));
                }
            /*  List<Quad> highlightQuads = new ArrayList<Quad>();
                List<TextChar> textChars = match.getTextChars();
                Rectangle2D firstRect = textChars.get(0).getBox();
                Rectangle2D lastRect = textChars.get(textChars.size()-1).getBox();
                Rectangle2D rect = firstRect.createUnion(lastRect);
                highlightQuads.add(Quad.get(rect));*/
                // subtype can be Highlight, Underline, StrikeOut, Squiggly


                new TextMarkup(page, highlightQuads, translationKeyword, MarkupTypeEnum.Highlight);

            }

            }

            public void remove() {
                throw new UnsupportedOperationException();
            }

        });

    }

}

    SerializationModeEnum serializationMode = SerializationModeEnum.Standard;
    file.save(new java.io.File(outputPath), serializationMode);
    System.out.println("file created");
    long endTime = System.currentTimeMillis();
    System.out.println("seconds take for execution is:"+(endTime-startTime)/1000);

} catch (Exception e) {
       e.printStackTrace();
}


        }
    }

Answer 1

正如評論中已經提到的（同時已經轉移到chat ）：

您的問題僅成為PDF小丑問題，因為您嘗試將購物車放在馬匹前面：

您已確定要創建太多突出顯示。

顯而易見的解決方案是從一開始就停止制作那些多余的突出顯示，並將其整理出來與PDF Clown無關。

另一方面，您嘗試的解決方案是在事后刪除多余的高光，這僅對您來說是一個PDF小丑問題，因為現在您必須搜索已經存在的高光中是否存在重疊。 該解決方案也是可行的，但是不必要地浪費了資源。

這里是一種在創建亮點之前對不需要的匹配進行分類的方法。 頁面循環內容將替換為：

[...]
TextExtractor textExtractor = new TextExtractor(true, true);
for (final Page page : file.getDocument().getPages()) {
    Map<Rectangle2D, List<ITextString>> textStrings = textExtractor.extract(page);

    List<Match> matches = new ArrayList<>();

    for (Keyword e : l) {
        final String searchKey = e.getKey();
        final String translationKeyword = e.getValue();

        final Pattern pattern;
        if ((searchKey.contains(")") && searchKey.contains("("))
                || (searchKey.contains("(") && !searchKey.contains(")"))
                || (searchKey.contains(")") && !searchKey.contains("(")) || searchKey.contains("?")
                || searchKey.contains("*") || searchKey.contains("+")) {
            pattern = Pattern.compile(Pattern.quote(searchKey), Pattern.CASE_INSENSITIVE);
        } else
            pattern = Pattern.compile("\\b" + searchKey + "\\b", Pattern.CASE_INSENSITIVE);

        final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings).toLowerCase());

        textExtractor.filter(textStrings, new TextExtractor.IIntervalFilter() {
            public boolean hasNext() {
                return matcher.find();
            }

            public Interval<Integer> next() {
                return new Interval<Integer>(matcher.start(), matcher.end(), true, false);
            }

            public void process(Interval<Integer> interval, ITextString match) {
                matches.add(new Match(interval, match, translationKeyword));
            }

            public void remove() {
                throw new UnsupportedOperationException();
            }
        });
    }

    removeOverlaps(matches);

    for (Match match : matches) {
        List<Quad> highlightQuads = new ArrayList<Quad>();
        {
            Rectangle2D textBox = null;
            for (TextChar textChar : match.match.getTextChars()) {
                Rectangle2D textCharBox = textChar.getBox();
                if (textBox == null) {
                    textBox = (Rectangle2D) textCharBox.clone();
                } else {
                    if (textCharBox.getY() > textBox.getMaxY()) {
                        highlightQuads.add(Quad.get(textBox));
                        textBox = (Rectangle2D) textCharBox.clone();
                    } else {
                        textBox.add(textCharBox);
                    }
                }

                textBox.setRect(textBox.getX(), textBox.getY(), textBox.getWidth(),
                        textBox.getHeight());
                highlightQuads.add(Quad.get(textBox));
            }

            new TextMarkup(page, highlightQuads, match.tag, MarkupTypeEnum.Highlight);
        }
    }
}
[...]

（ ComplexHighlight測試testMarkLikeSeshadriImproved ）

利用這些輔助方法/類：

static void removeOverlaps(List<Match> matches) {
    Collections.sort(matches, ComplexHighlight::compareLowLengthTag);

    for (int i = 0; i < matches.size() - 1; i++) {
        Interval<Integer> intervalI = matches.get(i).interval;
        for (int j = i + 1; j < matches.size(); j++) {
            Interval<Integer> intervalJ = matches.get(j).interval;
            if (intervalI.getLow() < intervalJ.getHigh() && intervalJ.getLow() < intervalI.getHigh()) {
                System.out.printf("Match %d removed as it overlaps match %d.\n", j, i);
                matches.remove(j--);
            }
        }
    }
}

（ ComplexHighlight方法removeOverlaps ）

static int compareLowLengthTag(Match a, Match b) {
    int compare = a.interval.getLow().compareTo(b.interval.getLow());
    if (compare == 0)
        compare = - a.interval.getHigh().compareTo(b.interval.getHigh());
    if (compare == 0)
        compare = a.tag.compareTo(b.tag);
    return compare;
}

（ ComplexHighlight方法compareLowLengthTag ）

class Match {
    final Interval<Integer> interval;
    final ITextString match;
    final String tag;

    public Match(final Interval<Integer> interval, final ITextString match, final String tag) {
        this.interval = interval;
        this.match = match;
        this.tag = tag;
    }
}

（比賽類）

如您所見，此處的匹配項不會立即添加為突出顯示，而是收集在列表matches 。 然后將該列表處理為不再包含重疊，並且僅將其余列表中不重疊的元素添加為突出顯示。

正如評論中所提到的，人們必須決定比賽的優先級。

例如，在搜索詞“ AB”和“ BCD”以及文檔文本“ ABCD”的情況下，上面使用的比較方法compareLowLengthTag始終首選AB匹配，而以下比較方法compareLengthLowTag首選較長的匹配BCD，並且只有在長度相等的情況下傾向於早些時候開始比賽：

static int compareLengthLowTag(Match a, Match b) {
    int aLength = a.interval.getHigh() - a.interval.getLow();
    int bLength = b.interval.getHigh() - b.interval.getLow();
    int compare = - Integer.compare(aLength, bLength);
    if (compare == 0)
        compare = a.interval.getLow().compareTo(b.interval.getLow());
    if (compare == 0)
        compare = a.tag.compareTo(b.tag);
    return compare;
}

（ ComplexHighlight方法compareLengthLowTag ）

pdfclown：如何覆蓋pdfclown中現有的突出顯示的關鍵字

問題描述

1 個解決方案

解決方案1
0 已采納 2018-04-19 14:58:45

pdfclown：如何覆蓋pdfclown中現有的突出顯示的關鍵字

問題描述

1 個解決方案

解決方案1 0 已采納 2018-04-19 14:58:45

解決方案1
0 已采納 2018-04-19 14:58:45