简体   繁体   English

pdfclown:如何覆盖pdfclown中现有的突出显示的关键字

[英]Pdfclown:How to override the existing highlighted keyword in pdfclown

I got the requirement in pdfclown like if there are few keywords which are substring/matched with another keyword, while highlighting those keywords has to be override and should allow to highlight full keyword .For example in below map ETS keyword is substring of just.ETS and Test.ETS keywords. 我在pdfclown中得到了要求,例如是否有很少的子字符串/与另一个关键词匹配的关键词,而突出显示这些关键词必须被覆盖并且应该允许突出显示完整关键词。例如在下面的地图中ETS关键词是just.ETS的子字符串和Test.ETS关键字。 And Expected result should be like We need to highlight full keyword like just.ETS , Test.ETS instead of ETS keyword and their popup measure value. 预期结果应类似于我们需要突出显示完整关键字,例如just.ETS,Test.ETS,而不是ETS关键字及其弹出度量值。 . ActualPdf and actual result pdf . ActualPdf实际结果pdf and jar path . 罐子路径

Map<String, String> m = new HashMap<String, String>();
        map.put("ETS" , "Loss");
        map.put("Just. ETS" , "Net ");
        map.put("Test. ETS" , "Profit");

(Note:1. If large size keyword is already highlighted in file then small size keyword which are matched with large keyword should not allow to highlight 2. If small size keyword is already highlighted and this keyword matched with large keyword then large keyword should higlight and ignore/unhighlight the small keyword.). (注意:1。如果文件中已经突出了大号关键字,则与大号关键字匹配的小号关键字不允许突出显示2。并忽略/取消显示小关键字。)。

    import java.awt.Color;
    import java.awt.Desktop;
    import java.awt.geom.Rectangle2D;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.UnsupportedEncodingException;
    import java.net.URL;
    import java.nio.charset.Charset;
    import java.util.ArrayList;
    import java.util.Collection;
    import java.util.Date;
    import java.util.HashMap;
    import java.util.List;
    import java.util.Map;
    import java.util.concurrent.TimeUnit;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    import java.io.File;
    import org.pdfclown.documents.Page;
    import org.pdfclown.documents.contents.ITextString;
    import org.pdfclown.documents.contents.TextChar;
    import org.pdfclown.documents.contents.colorSpaces.DeviceRGBColor;
    import org.pdfclown.documents.interaction.annotations.TextMarkup;
    import org.pdfclown.documents.interaction.annotations.TextMarkup.MarkupTypeEnum;

    import org.pdfclown.files.SerializationModeEnum;
    import org.pdfclown.util.math.Interval;
    import org.pdfclown.util.math.geom.Quad;
    import org.pdfclown.tools.TextExtractor;

    public class pdfclown2 {
        private static int count;

        public static void main(String[] args) throws IOException {

            highlight("C:\\Users\\uc23\\Desktop\\pdf\\80743064.pdf","C:\\Users\\\Downloads\\6.pdf");
            System.out.println("OK");
        }
        private static void highlight(String inputPath, String outputPath) throws IOException {




   org.pdfclown.files.File file = null;

try {
    file = new org.pdfclown.files.File("C:\\Users\\uc239646\\Desktop\\test.pdf");

List<Keyword> l=new ArrayList<Keyword>();
Keyword k=new Keyword();
Keyword k1=new Keyword();
k1.setKey("Just. ETS");
k1.setValue("NET");
l.add(k1);
Keyword k2=new Keyword();
k2.setKey("Test. ETS");
k2.setValue("PROFIT");
l.add(k2);
k.setKey("ETS");
k.setValue("LOSS");
l.add(k);

 long startTime = System.currentTimeMillis();




    // 2. Iterating through the document pages...
    TextExtractor textExtractor = new TextExtractor(true, true);
    for (final Page page : file.getDocument().getPages()) {
        Map<Rectangle2D, List<ITextString>> textStrings = textExtractor.extract(page);
        for (Keyword e : l) {
            Pattern pattern;
            String serachKey =  e.getKey();
            final String translationKeyword = e.getValue();

                if ((serachKey.contains(")") && serachKey.contains("("))
                        || (serachKey.contains("(") && !serachKey.contains(")"))
                        || (serachKey.contains(")") && !serachKey.contains("(")) || serachKey.contains("?")
                        || serachKey.contains("*") || serachKey.contains("+")) {
                    pattern = Pattern.compile(Pattern.quote(serachKey), Pattern.CASE_INSENSITIVE);
                }
                else
                     pattern = Pattern.compile("\\b"+serachKey+"\\b", Pattern.CASE_INSENSITIVE);
        // 2.1. Extract the page text!

    //System.out.println(textStrings.toString().indexOf(entry.getKey()));

        // 2.2. Find the text pattern matches!
                        final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings).toLowerCase());
        // 2.3. Highlight the text pattern matches!
        //System.out.println(textStrings);
        textExtractor.filter(textStrings, new TextExtractor.IIntervalFilter() {

            public boolean hasNext() {
                // if(key.getMatchCriteria() == 1){
                if (matcher.find()) {
                    return true;
                }
                /*
                 * } else if(key.getMatchCriteria() == 2) { if
                 * 
                 * 
                 * 
                 * 
                 * 
                 * 
                 * 
                 * 
                 * (matcher.hitEnd()) { count++; return true; } }
                 */
                return false;

            }

            public Interval<Integer> next() {
                return new Interval<Integer>(matcher.start(), matcher.end());
            }

            public void process(Interval<Integer> interval, ITextString match) {
                System.out.println(match);
                // Defining the highlight box of the text pattern
                // match...
                /*List l=new ArrayList();
                if(!l.contains(match)){
                    System.out.println("map.put("+match+","+translationKeyword+")");
                }
            */
                List<Quad> highlightQuads = new ArrayList<Quad>();
                {
                    Rectangle2D textBox = null;
                    for (TextChar textChar : match.getTextChars()) {
                        Rectangle2D textCharBox = textChar.getBox();
                        if (textBox == null) {
                            textBox = (Rectangle2D) textCharBox.clone();
                        } else {
                            if (textCharBox.getY() > textBox.getMaxY()) {
                                highlightQuads.add(Quad.get(textBox));
                                textBox = (Rectangle2D) textCharBox.clone();
                            } else {
                                textBox.add(textCharBox);
                            }
                        }

                    System.out.println(highlightQuads.contains(textBox));

                    textBox.setRect(textBox.getX(), textBox.getY(), textBox.getWidth(), textBox.getHeight());
                    highlightQuads.add(Quad.get(textBox));
                }
            /*  List<Quad> highlightQuads = new ArrayList<Quad>();
                List<TextChar> textChars = match.getTextChars();
                Rectangle2D firstRect = textChars.get(0).getBox();
                Rectangle2D lastRect = textChars.get(textChars.size()-1).getBox();
                Rectangle2D rect = firstRect.createUnion(lastRect);
                highlightQuads.add(Quad.get(rect));*/
                // subtype can be Highlight, Underline, StrikeOut, Squiggly


                new TextMarkup(page, highlightQuads, translationKeyword, MarkupTypeEnum.Highlight);

            }

            }

            public void remove() {
                throw new UnsupportedOperationException();
            }

        });

    }

}

    SerializationModeEnum serializationMode = SerializationModeEnum.Standard;
    file.save(new java.io.File(outputPath), serializationMode);
    System.out.println("file created");
    long endTime = System.currentTimeMillis();
    System.out.println("seconds take for execution is:"+(endTime-startTime)/1000);

} catch (Exception e) {
       e.printStackTrace();
}


        }
    }

As already mentioned in comments (which meanwhile have been moved to chat ): 正如评论中已经提到的(同时已经转移到chat ):

Your issue only becomes a PDF Clown issue because you try to put the cart before the horse: 您的问题仅成为PDF小丑问题,因为您尝试将购物车放在马匹前面:

You have determined that you are creating too many highlights. 您已确定要创建太多突出显示。

The obvious solution would be to stop making those surplus highlights from the start, and sorting that out is an issue unrelated to PDF Clown. 显而易见的解决方案是从一开始就停止制作那些多余的突出显示,并将其整理出来与PDF Clown无关。

Your attempted solutions, on the other hand, is to remove the surplus highlights after the fact, and only this makes it an PDF Clown issue for you because now you have to search the already existing highlights for overlaps. 另一方面,您尝试的解决方案是在事后删除多余的高光,这仅对您来说是一个PDF小丑问题,因为现在您必须搜索已经存在的高光中是否存在重叠。 That solution is a possible one, too, but it unnecessarily wastes resources. 该解决方案也是可行的,但是不必要地浪费了资源。

Here an approach that sorts out unwanted matches before highlights are created for them. 这里是一种创建亮点之前对不需要的匹配进行分类的方法。 The contents of your loop over the pages is replaced like this: 页面循环内容将替换为:

[...]
TextExtractor textExtractor = new TextExtractor(true, true);
for (final Page page : file.getDocument().getPages()) {
    Map<Rectangle2D, List<ITextString>> textStrings = textExtractor.extract(page);

    List<Match> matches = new ArrayList<>();

    for (Keyword e : l) {
        final String searchKey = e.getKey();
        final String translationKeyword = e.getValue();

        final Pattern pattern;
        if ((searchKey.contains(")") && searchKey.contains("("))
                || (searchKey.contains("(") && !searchKey.contains(")"))
                || (searchKey.contains(")") && !searchKey.contains("(")) || searchKey.contains("?")
                || searchKey.contains("*") || searchKey.contains("+")) {
            pattern = Pattern.compile(Pattern.quote(searchKey), Pattern.CASE_INSENSITIVE);
        } else
            pattern = Pattern.compile("\\b" + searchKey + "\\b", Pattern.CASE_INSENSITIVE);

        final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings).toLowerCase());

        textExtractor.filter(textStrings, new TextExtractor.IIntervalFilter() {
            public boolean hasNext() {
                return matcher.find();
            }

            public Interval<Integer> next() {
                return new Interval<Integer>(matcher.start(), matcher.end(), true, false);
            }

            public void process(Interval<Integer> interval, ITextString match) {
                matches.add(new Match(interval, match, translationKeyword));
            }

            public void remove() {
                throw new UnsupportedOperationException();
            }
        });
    }

    removeOverlaps(matches);

    for (Match match : matches) {
        List<Quad> highlightQuads = new ArrayList<Quad>();
        {
            Rectangle2D textBox = null;
            for (TextChar textChar : match.match.getTextChars()) {
                Rectangle2D textCharBox = textChar.getBox();
                if (textBox == null) {
                    textBox = (Rectangle2D) textCharBox.clone();
                } else {
                    if (textCharBox.getY() > textBox.getMaxY()) {
                        highlightQuads.add(Quad.get(textBox));
                        textBox = (Rectangle2D) textCharBox.clone();
                    } else {
                        textBox.add(textCharBox);
                    }
                }

                textBox.setRect(textBox.getX(), textBox.getY(), textBox.getWidth(),
                        textBox.getHeight());
                highlightQuads.add(Quad.get(textBox));
            }

            new TextMarkup(page, highlightQuads, match.tag, MarkupTypeEnum.Highlight);
        }
    }
}
[...]

( ComplexHighlight test testMarkLikeSeshadriImproved ) ComplexHighlight测试testMarkLikeSeshadriImproved

making use of these helper methods / classes: 利用这些辅助方法/类:

static void removeOverlaps(List<Match> matches) {
    Collections.sort(matches, ComplexHighlight::compareLowLengthTag);

    for (int i = 0; i < matches.size() - 1; i++) {
        Interval<Integer> intervalI = matches.get(i).interval;
        for (int j = i + 1; j < matches.size(); j++) {
            Interval<Integer> intervalJ = matches.get(j).interval;
            if (intervalI.getLow() < intervalJ.getHigh() && intervalJ.getLow() < intervalI.getHigh()) {
                System.out.printf("Match %d removed as it overlaps match %d.\n", j, i);
                matches.remove(j--);
            }
        }
    }
}

( ComplexHighlight method removeOverlaps ) ComplexHighlight方法removeOverlaps

static int compareLowLengthTag(Match a, Match b) {
    int compare = a.interval.getLow().compareTo(b.interval.getLow());
    if (compare == 0)
        compare = - a.interval.getHigh().compareTo(b.interval.getHigh());
    if (compare == 0)
        compare = a.tag.compareTo(b.tag);
    return compare;
}

( ComplexHighlight method compareLowLengthTag ) ComplexHighlight方法compareLowLengthTag

class Match {
    final Interval<Integer> interval;
    final ITextString match;
    final String tag;

    public Match(final Interval<Integer> interval, final ITextString match, final String tag) {
        this.interval = interval;
        this.match = match;
        this.tag = tag;
    }
}

( Match class) 比赛类)

As you see the matches here are not immediately added as highlights but instead collected in a list matches . 如您所见,此处的匹配项不会立即添加为突出显示,而是收集在列表matches This list then is processed to not contain overlaps anymore, and only the elements of the remaining list without overlaps are added as highlights. 然后将该列表处理为不再包含重叠,并且仅将其余列表中不重叠的元素添加为突出显示。

As also mentioned in comments one has to decide on priorities among the matches. 正如评论中所提到的,人们必须决定比赛的优先级。

Eg in case of search terms "AB" and "BCD" and a document text "ABCD" the comparison method compareLowLengthTag used above always prefers the AB match while the following comparison method compareLengthLowTag prefers the longer match BCD and only in case of equal lengths would have resorted to preferring a match starting earlier: 例如,在搜索词“ AB”和“ BCD”以及文档文本“ ABCD”的情况下,上面使用的比较方法compareLowLengthTag始终首选AB匹配,而以下比较方法compareLengthLowTag首选较长的匹配BCD,并且只有在长度相等的情况下倾向于早些时候开始比赛:

static int compareLengthLowTag(Match a, Match b) {
    int aLength = a.interval.getHigh() - a.interval.getLow();
    int bLength = b.interval.getHigh() - b.interval.getLow();
    int compare = - Integer.compare(aLength, bLength);
    if (compare == 0)
        compare = a.interval.getLow().compareTo(b.interval.getLow());
    if (compare == 0)
        compare = a.tag.compareTo(b.tag);
    return compare;
}

( ComplexHighlight method compareLengthLowTag ) ComplexHighlight方法compareLengthLowTag

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM