简体   繁体   English

使用PDFClown突出显示文本而不使用PDF批注

[英]Text Highlighting with PDFClown without using PDF Annotations

I've started using PDFClown some weeks ago. 我几周前开始使用PDFClown。 My purpose is multi-word highlighting, mainly on newspapers. 我的目的是多单词突出显示,主要是在报纸上。 Starting from the org.pdfclown.samples.cli.TextHighlightSample example, I succeeded in extracting multi-word positions and highlighting them. org.pdfclown.samples.cli.TextHighlightSample示例开始,我成功提取了多个单词的位置并突出显示了它们。 I even solved some problems due to text ordering and matching in most cases. 在大多数情况下,我什至解决了一些由于文本排序和匹配导致的问题。

Unfortunately my framework includes FPDI and it does not consider PDFAnnotations . 不幸的是,我的框架包括FPDI ,并且没有考虑PDFAnnotations So, all the content outside of a page content stream, like text annotations and other so called markup annotations, get lost. 因此,页面内容流之外的所有内容(例如文本注释和其他所谓的标记注释)都会丢失。

So any suggestion on creating "Text Highlighting" with PdfClown and without using PDF annotations? 那么关于在不使用PDF注释的情况下使用PdfClown创建“文本突出显示”的任何建议吗?

To not have the highlight in an annotation but instead in the actual page content stream, one has to put the graphic commandos into the page content stream which in case of the org.pdfclown.samples.cli.TextHighlightSample example are implicitly put into the normal annotation appearance stream. 为了在注释中没有高亮显示,而是在实际的页面内容流中显示突出显示,必须将图形突击队员放入页面内容流中,在org.pdfclown.samples.cli.TextHighlightSample示例的情况下,将其隐式地放入普通内容中。注释外观流。

This can be implemented like this: 可以这样实现:

org.pdfclown.files.File file = new org.pdfclown.files.File(resource);
Pattern pattern = Pattern.compile("S", Pattern.CASE_INSENSITIVE);
TextExtractor textExtractor = new TextExtractor(true, true);

for (final Page page : file.getDocument().getPages())
{
    final List<Quad> highlightQuads = new ArrayList<Quad>();

    Map<Rectangle2D, List<ITextString>> textStrings = textExtractor.extract(page);
    final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings));

    textExtractor.filter(textStrings, new TextExtractor.IIntervalFilter()
    {
        @Override
        public boolean hasNext()
        {
            return matcher.find();
        }

        @Override
        public Interval<Integer> next()
        {
            return new Interval<Integer>(matcher.start(), matcher.end());
        }

        @Override
        public void process(Interval<Integer> interval, ITextString match)
        {
            {
                Rectangle2D textBox = null;
                for (TextChar textChar : match.getTextChars())
                {
                    Rectangle2D textCharBox = textChar.getBox();
                    if (textBox == null)
                    {
                        textBox = (Rectangle2D) textCharBox.clone();
                    }
                    else
                    {
                        if (textCharBox.getY() > textBox.getMaxY())
                        {
                            highlightQuads.add(Quad.get(textBox));
                            textBox = (Rectangle2D) textCharBox.clone();
                        }
                        else
                        {
                            textBox.add(textCharBox);
                        }
                    }
                }
                highlightQuads.add(Quad.get(textBox));
            }
        }

        @Override
        public void remove()
        {
            throw new UnsupportedOperationException();
        }
    });

    // Highlight the text pattern match!
    ExtGState defaultExtGState = new ExtGState(file.getDocument());
    defaultExtGState.setAlphaShape(false);
    defaultExtGState.setBlendMode(Arrays.asList(BlendModeEnum.Multiply));

    PrimitiveComposer composer = new PrimitiveComposer(page);
    composer.getScanner().moveEnd();
    // TODO: reset graphics state here.
    composer.applyState(defaultExtGState);
    composer.setFillColor(new DeviceRGBColor(1, 1, 0));
    {
        for (Quad markupBox : highlightQuads)
        {
            Point2D[] points = markupBox.getPoints();
            double markupBoxHeight = points[3].getY() - points[0].getY();
            double markupBoxMargin = markupBoxHeight * .25;
            composer.drawCurve(new Point2D.Double(points[3].getX(), points[3].getY()),
                    new Point2D.Double(points[0].getX(), points[0].getY()),
                    new Point2D.Double(points[3].getX() - markupBoxMargin, points[3].getY() - markupBoxMargin),
                    new Point2D.Double(points[0].getX() - markupBoxMargin, points[0].getY() + markupBoxMargin));
            composer.drawLine(new Point2D.Double(points[1].getX(), points[1].getY()));
            composer.drawCurve(new Point2D.Double(points[2].getX(), points[2].getY()),
                    new Point2D.Double(points[1].getX() + markupBoxMargin, points[1].getY() + markupBoxMargin),
                    new Point2D.Double(points[2].getX() + markupBoxMargin, points[2].getY() - markupBoxMargin));
            composer.fill();
        }
    }
    composer.flush();
}

file.save(new File(RESULT_FOLDER, "multiPage-highlight-content.pdf"), SerializationModeEnum.Incremental);

( HighlightInContent.java method testHighlightInContent) HighlightInContent.java方法testHighlightInContent)

You will recognize the text extraction frame from the original example. 您将从原始示例中识别出文本提取框架。 Merely now the quads from a whole page are collected before they are processed, and the processing code (which mostly has been borrowed from TextMarkup.refreshAppearance() ) draws forms representing the quads into the page content. 现在仅需要处理整个页面中的四边形,然后再对其进行处理,并且处理代码(大部分代码是从TextMarkup.refreshAppearance()借用的)将代表四边形的表单绘制到页面内容中。

Beware, to make this work generically, the graphics state has to be reset before inserting the new instructions (the position is marked with a TODO comment). 请注意,要使该功能正常工作,必须在插入新指令之前重置图形状态(该位置标记有TODO注释)。 This can be done either by applying save/restore state or by actually counteracting unwanted changed state entries. 这可以通过应用保存/恢复状态或通过实际抵消不需要的已更改状态条目来完成。 Unfortunately I did not see how to do the former in PDF Clown and have not yet had the time to do the latter. 不幸的是,我没有在PDF Clown中看到如何做前者,并且还没有时间去做后者。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM