[英]PDFClown Different font-size in one line
I´m using PDFClown to analyze a PDF Document. 我正在使用PDFClown分析PDF文档。 In many documents it seems that some characters in PDFClown have different heights even if they obviously have the same height.
在许多文档中,PDFClown中的某些字符似乎具有不同的高度,即使它们显然具有相同的高度。 Is there a workaround?
有解决方法吗?
This is the Code: 这是代码:
while(_level.moveNext()) {
ContentObject content = _level.getCurrent();
if(content instanceof Text) {
ContentScanner.TextWrapper text = (ContentScanner.TextWrapper)_level.getCurrentWrapper();
for(ContentScanner.TextStringWrapper textString : text.getTextStrings()) {
List<CharInfo> chars = new ArrayList<>();
for(TextChar textChar : textString.getTextChars()) {
chars.add(new CharInfo(textChar.getBox(), textChar.getValue()));
}
}
}
else if(content instanceof XObject) {
// Scan the external level
if(((XObject)content).getScanner(_level)!=null){
getContentLines(((XObject)content).getScanner(_level));
}
}
else if(content instanceof ContainerObject){
// Scan the inner level
if(_level.getChildLevel()!=null){
getContentLines(_level.getChildLevel());
}
}
}
Here is an example PDFDocument: 这是一个PDFDocument示例:
In this Document I marked two text chunks which both contains the word "million". 在本文档中,我标记了两个文本块,两个文本块均包含“百万”一词。 When analyzing the size of each char in both "million" the following happens:
分析两个“百万”中每个字符的大小时,会发生以下情况:
Even if all chars of the two text chunks obviously have the same size pdf clown said that the sizes are different. 即使两个文本块的所有字符明显具有相同的大小,pdf小丑也表示大小不同。
The issue is caused by a bug in PDF Clown: it assumes that marked content sections and save/restore graphics state blocks are properly contained in each other and don't overlap. 该问题是由PDF Clown中的一个错误引起的:它假定标记的内容部分和保存/恢复图形状态块正确地包含在彼此之间并且不会重叠。 Ie it assumes that these structures only intermingle as
即假设这些结构仅混合为
begin-marked-content
save-graphics-state
restore-graphics-state
end-marked-content
or 要么
save-graphics-state
begin-marked-content
end-marked-content
restore-graphics-state
but never as 但从来没有像
save-graphics-state
begin-marked-content
restore-graphics-state
end-marked-content
or 要么
begin-marked-content
save-graphics-state
end-marked-content
restore-graphics-state.
Unfortunately this assumption is wrong, marked content sections and save/restore graphics state blocks can intermingle any way they like. 不幸的是,这种假设是错误的,标记的内容部分和保存/恢复图形状态块可能会以他们喜欢的任何方式混合在一起。
Eg in the document at hand there are sequences like this: 例如,在手头的文档中有如下序列:
q
[...1...]
/P <</MCID 0 >>BDC
Q
[...2...]
EMC
Here [...1...]
is contained in the save/restore graphics state block enveloped by q
and Q
and [...2...]
is contained in the marked content block enveloped by /P <</MCID 0 >>BDC
and EMC
. 这里
[...1...]
包含在由q
和Q
包围的保存/恢复图形状态块中, [...2...]
包含在由/P <</MCID 0 >>BDC
包围的标记内容块中/P <</MCID 0 >>BDC
和EMC
。
Due to the wrong assumption, though, and the way /P <</MCID 0 >>BDC
and Q
are arranged, PDF Clown parses the above as [...1...]
and an empty marked content block and [...2...]
being contained in a save/restore graphics state block. 但是,由于错误的假设以及
/P <</MCID 0 >>BDC
和Q
的排列方式,PDF小丑将以上内容解析为[...1...]
并标记了一个空的内容块和[...2...]
包含在保存/恢复图形状态块中。
Thus, if there are changes in the graphics state inside [...2...]
, PDF Clown assumes them limited to the lines above while they actually are not. 因此,如果
[...2...]
内部的图形状态发生变化,PDF小丑会假定它们仅限于上面的行,而实际上没有。
The only easy way I found to repair this was to disable the marked content parsing in PDF Clown. 我发现修复此问题的唯一简单方法是禁用PDF Clown中的标记内容解析。
To do this I changed org.pdfclown.documents.contents.tokens.ContentParser
as follows: 为此,我将
org.pdfclown.documents.contents.tokens.ContentParser
更改如下:
In parseContentObjects()
I disablked the contentObject instanceof EndMarkedContent
option: 在
parseContentObjects()
我取消了contentObject instanceof EndMarkedContent
选项的contentObject instanceof EndMarkedContent
:
public List<ContentObject> parseContentObjects( ) { final List<ContentObject> contentObjects = new ArrayList<ContentObject>(); while(moveNext()) { ContentObject contentObject = parseContentObject(); // Multiple-operation graphics object end? if(contentObject instanceof EndText // Text. || contentObject instanceof RestoreGraphicsState // Local graphics state. /* || contentObject instanceof EndMarkedContent // End marked-content sequence. */ || contentObject instanceof EndInlineImage) // Inline image. return contentObjects; contentObjects.add(contentObject); } return contentObjects; }
In parseContentObject
I removed the if(operation instanceof BeginMarkedContent)
branch: 在
parseContentObject
,删除了if(operation instanceof BeginMarkedContent)
分支:
public ContentObject parseContentObject( ) { final Operation operation = parseOperation(); if(operation instanceof PaintXObject) // External object. return new XObject((PaintXObject)operation); else if(operation instanceof PaintShading) // Shading. return new Shading((PaintShading)operation); else if(operation instanceof BeginSubpath || operation instanceof DrawRectangle) // Path. return parsePath(operation); else if(operation instanceof BeginText) // Text. return new Text( parseContentObjects() ); else if(operation instanceof SaveGraphicsState) // Local graphics state. return new LocalGraphicsState( parseContentObjects() ); /* else if(operation instanceof BeginMarkedContent) // Marked-content sequence. return new MarkedContent( (BeginMarkedContent)operation, parseContentObjects() ); */ else if(operation instanceof BeginInlineImage) // Inline image. return parseInlineImage(); else // Single operation. return operation; }
With these changes in place, the character sizes are properly extracted. 通过这些更改,可以正确提取字符大小。
As an aside, while the returned individual character boxes seem to imply that the box is completely custom to the character in question, that is not true: Merely the width of the box is character specific, the height is calculated from overall font properties (and the current font size) but not specifically to the character, cf. 顺便说一句,虽然返回的单个字符框似乎暗示该框完全是针对所讨论字符的自定义,但事实并非如此:仅框的宽度是特定于字符的,而高度是根据整体字体属性计算得出的(并且当前字体大小),但不专门针对字符,请参见。 the
org.pdfclown.documents.contents.fonts.Font
method getHeight(char)
: org.pdfclown.documents.contents.fonts.Font
方法getHeight(char)
:
/**
Gets the unscaled height of the given character.
@param textChar
Character whose height has to be calculated.
*/
public final double getHeight(
char textChar
)
{
/*
TODO: Calculate actual text height through glyph bounding box.
*/
if(textHeight == -1)
{textHeight = getAscent() - getDescent();}
return textHeight;
}
Individual character height calculation still is a TODO. 单个字符高度的计算仍然是一个待办事项。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.