PDFClown在一行中显示不同的字体大小

Question

I´m using PDFClown to analyze a PDF Document. 我正在使用PDFClown分析PDF文档。 In many documents it seems that some characters in PDFClown have different heights even if they obviously have the same height. 在许多文档中，PDFClown中的某些字符似乎具有不同的高度，即使它们显然具有相同的高度。 Is there a workaround? 有解决方法吗？

This is the Code: 这是代码：

    while(_level.moveNext()) {
        ContentObject content = _level.getCurrent();
        if(content instanceof Text) {
            ContentScanner.TextWrapper text = (ContentScanner.TextWrapper)_level.getCurrentWrapper();
            for(ContentScanner.TextStringWrapper textString : text.getTextStrings()) {
                List<CharInfo> chars = new ArrayList<>();
                for(TextChar textChar : textString.getTextChars()) {
                    chars.add(new CharInfo(textChar.getBox(), textChar.getValue()));
                }
            }
        }
        else if(content instanceof XObject) {
            // Scan the external level
            if(((XObject)content).getScanner(_level)!=null){
                getContentLines(((XObject)content).getScanner(_level));
            }
        }
        else if(content instanceof ContainerObject){
            // Scan the inner level
            if(_level.getChildLevel()!=null){
                getContentLines(_level.getChildLevel());
            }
        }
    }

Here is an example PDFDocument: 这是一个PDFDocument示例：

Example 例

In this Document I marked two text chunks which both contains the word "million". 在本文档中，我标记了两个文本块，两个文本块均包含“百万”一词。 When analyzing the size of each char in both "million" the following happens: 分析两个“百万”中每个字符的大小时，会发生以下情况：

"m" in the first mark has the height : 14,50 and the width : 8,5 第一个标记中的“ m”高度为：14,50，宽度为：8.5
"i" in the first mark has the height: 14,50 and thw width: 3,0 第一个标记中的“ i”高度为：14,50，宽度为：3.0
"l" in the first mark has the height : 14,50 and the width 3,0 第一个标记中的“ l”高度为：14,50，宽度为3,0
"m" in the second mark has the height: 10,56 and the width: 6,255 第二个标记中的“ m”具有高度：10,56和宽度：6,255
"i" in the second mark has the height: 10,56 and the width: 2,23 第二个标记中的“ i”具有高度：10,56和宽度：2,23
"l" in the second mark has the height: 10,56 and the width: 2,23 第二个标记中的“ l”具有高度：10,56和宽度：2,23

Even if all chars of the two text chunks obviously have the same size pdf clown said that the sizes are different. 即使两个文本块的所有字符明显具有相同的大小，pdf小丑也表示大小不同。

Answer 1

The issue is caused by a bug in PDF Clown: it assumes that marked content sections and save/restore graphics state blocks are properly contained in each other and don't overlap. 该问题是由PDF Clown中的一个错误引起的：它假定标记的内容部分和保存/恢复图形状态块正确地包含在彼此之间并且不会重叠。 Ie it assumes that these structures only intermingle as 即假设这些结构仅混合为

begin-marked-content
save-graphics-state
restore-graphics-state
end-marked-content

or 要么

save-graphics-state
begin-marked-content
end-marked-content
restore-graphics-state

but never as 但从来没有像

save-graphics-state
begin-marked-content
restore-graphics-state
end-marked-content

or 要么

begin-marked-content
save-graphics-state
end-marked-content
restore-graphics-state.

Unfortunately this assumption is wrong, marked content sections and save/restore graphics state blocks can intermingle any way they like. 不幸的是，这种假设是错误的，标记的内容部分和保存/恢复图形状态块可能会以他们喜欢的任何方式混合在一起。

Eg in the document at hand there are sequences like this: 例如，在手头的文档中有如下序列：

q
[...1...]
/P <</MCID 0 >>BDC 
Q
[...2...]
EMC

Here [...1...] is contained in the save/restore graphics state block enveloped by q and Q and [...2...] is contained in the marked content block enveloped by /P <</MCID 0 >>BDC and EMC . 这里[...1...]包含在由q和Q包围的保存/恢复图形状态块中， [...2...]包含在由/P <</MCID 0 >>BDC包围的标记内容块中/P <</MCID 0 >>BDC和EMC 。

Due to the wrong assumption, though, and the way /P <</MCID 0 >>BDC and Q are arranged, PDF Clown parses the above as [...1...] and an empty marked content block and [...2...] being contained in a save/restore graphics state block. 但是，由于错误的假设以及/P <</MCID 0 >>BDC和Q的排列方式，PDF小丑将以上内容解析为[...1...]并标记了一个空的内容块和[...2...]包含在保存/恢复图形状态块中。

Thus, if there are changes in the graphics state inside [...2...] , PDF Clown assumes them limited to the lines above while they actually are not. 因此，如果[...2...]内部的图形状态发生变化，PDF小丑会假定它们仅限于上面的行，而实际上没有。

The only easy way I found to repair this was to disable the marked content parsing in PDF Clown. 我发现修复此问题的唯一简单方法是禁用PDF Clown中的标记内容解析。

To do this I changed org.pdfclown.documents.contents.tokens.ContentParser as follows: 为此，我将org.pdfclown.documents.contents.tokens.ContentParser更改如下：

In parseContentObjects() I disablked the contentObject instanceof EndMarkedContent option: 在parseContentObjects()我取消了contentObject instanceof EndMarkedContent选项的contentObject instanceof EndMarkedContent ：

  public List<ContentObject> parseContentObjects( ) { final List<ContentObject> contentObjects = new ArrayList<ContentObject>(); while(moveNext()) { ContentObject contentObject = parseContentObject(); // Multiple-operation graphics object end? if(contentObject instanceof EndText // Text. || contentObject instanceof RestoreGraphicsState // Local graphics state. /* || contentObject instanceof EndMarkedContent // End marked-content sequence. */ || contentObject instanceof EndInlineImage) // Inline image. return contentObjects; contentObjects.add(contentObject); } return contentObjects; }

In parseContentObject I removed the if(operation instanceof BeginMarkedContent) branch: 在parseContentObject ，删除了if(operation instanceof BeginMarkedContent)分支：

  public ContentObject parseContentObject( ) { final Operation operation = parseOperation(); if(operation instanceof PaintXObject) // External object. return new XObject((PaintXObject)operation); else if(operation instanceof PaintShading) // Shading. return new Shading((PaintShading)operation); else if(operation instanceof BeginSubpath || operation instanceof DrawRectangle) // Path. return parsePath(operation); else if(operation instanceof BeginText) // Text. return new Text( parseContentObjects() ); else if(operation instanceof SaveGraphicsState) // Local graphics state. return new LocalGraphicsState( parseContentObjects() ); /* else if(operation instanceof BeginMarkedContent) // Marked-content sequence. return new MarkedContent( (BeginMarkedContent)operation, parseContentObjects() ); */ else if(operation instanceof BeginInlineImage) // Inline image. return parseInlineImage(); else // Single operation. return operation; }

With these changes in place, the character sizes are properly extracted. 通过这些更改，可以正确提取字符大小。

As an aside, while the returned individual character boxes seem to imply that the box is completely custom to the character in question, that is not true: Merely the width of the box is character specific, the height is calculated from overall font properties (and the current font size) but not specifically to the character, cf. 顺便说一句，虽然返回的单个字符框似乎暗示该框完全是针对所讨论字符的自定义，但事实并非如此：仅框的宽度是特定于字符的，而高度是根据整体字体属性计算得出的（并且当前字体大小），但不专门针对字符，请参见。 the org.pdfclown.documents.contents.fonts.Font method getHeight(char) : org.pdfclown.documents.contents.fonts.Font方法getHeight(char) ：

  /**
    Gets the unscaled height of the given character.

    @param textChar
      Character whose height has to be calculated.
  */
  public final double getHeight(
    char textChar
    )
  {
    /*
      TODO: Calculate actual text height through glyph bounding box.
    */
    if(textHeight == -1)
    {textHeight = getAscent() - getDescent();}
    return textHeight;
  }

Individual character height calculation still is a TODO. 单个字符高度的计算仍然是一个待办事项。

PDFClown在一行中显示不同的字体大小

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-08-14 09:55:34

PDFClown在一行中显示不同的字体大小

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-08-14 09:55:34

解决方案1
1 已采纳 2017-08-14 09:55:34