简体   繁体   English

PDFClown MarkerContent 只给出前两个 ContentObjects

[英]PDFClown MarkerContent gives only first two ContentObjects

I am a newbee to PDFClown and need help in parsing my pdf contents.我是 PDFClown 的新手,需要帮助解析我的 pdf 内容。

My PDF has huge number of MarkedContents which is displayed when converted as Stream.我的 PDF 有大量 MarkedContents,在转换为 Stream 时显示。

But i am not able to parse them into objects to extract the Path Information contained within, which is my objective.但是我无法将它们解析为对象以提取其中包含的路径信息,这是我的目标。

Here is my code -这是我的代码 -

if(level.Contents[i] is MarkedContent)
{

 PdfDataObject ContentDataObj = level.Contents.BaseDataObject;
 PdfIndirectObject pdfIndirectObject = level.Contents.BaseDataObject.IndirectObject;

 PdfStream ContentStream = (PdfStream)ContentDataObj.Resolve();


 ContentParser contentParser = new ContentParser(ContentStream.GetBody(true).ToByteArray());
 IList<ContentObject> markerContentObjList = contentParser.ParseContentObjects();

 //Here i am getting only two Content Objects, where as the stream has so many distinct Marked Contents

 for (int k = 0; k < markerContentObjList.Count; k++)
 {

 }
}

Below is the DOM Inspector screenshot and Stream data下面是 DOM Inspector 截图和流数据

在此处输入图片说明

In Short简而言之

There are multiple errors in the content streams of your PDF, in particular errors that close more objects than are opened. PDF 的内容流中存在多个错误,特别是关闭的对象多于打开的对象的错误。 This most likely is causing the early stop of parsing.这很可能是导致解析提前停止的原因。 Even if it is not, PDF Clown would associate starts and ends of objects differently than intended.即使不是,PDF Clown 也会以不同于预期的方式关联对象的开始和结束。 Thus, the only real fix of the issue is to ask the source of the documents to provide a non-broken version.因此,问题的唯一真正解决方法是要求文档来源提供未损坏的版本。

The First Content Stream第一个内容流

The screen shot you provided shows your first page content stream:您提供的屏幕截图显示了您的第一页内容流:

第一个内容流

The second content stream of that page exhibits the same issues as this one:该页面的第二个内容流表现出与此相同的问题:

Non-Matching Starts and Ends of Marked Content Sequences不匹配的标记内容序列的开始和结束

If we look at the marked content operators, we see如果我们查看标记的内容运算符,我们会看到

/OC /Heading BDC
...
EMC
EMC
/OC /Heading BDC
...
EMC

As you can see, there are two EMC operators for the first BDC .如您所见,第一个BDC两个EMC运算符。 This is invalid.这是无效的。 Confer ISO 32000-2 section 14.6 Marked content .授予 ISO 32000-2 第 14.6 节标记内容

Invalid Fill Operator无效的填充运算符

Furthermore, there is a Fill operator directly following a text object:此外,在文本对象之后有一个 Fill 运算符:

BT
...
ET
f

This also is invalid, path painting operators are only allowed after a path object or a clipping path object, not after a text object.这也是无效的,路径绘制操作符只允许在路径对象或剪切路径对象之后,而不是在文本对象之后。 Confer ISO 32000-2 Figure 9 Graphics objects .授予 ISO 32000-2 图 9图形对象

A Related PDF Clown Issue一个相关的 PDF 小丑问题

Actually there is a bug in PDF Clown which makes processing of marked content with PDF Clown impossible anyway: PDF Clown assumes that marked content sections and save/restore graphics state blocks are properly contained in each other and don't overlap, see this answer for details.实际上,PDF Clown 中存在一个错误,这使得无论如何都无法使用 PDF Clown 处理标记内容:PDF Clown 假定标记内容部分和保存/恢复图形状态块彼此正确包含并且不重叠,请参阅此答案细节。 This assumption is wrong and results in incorrect graphic state contents as explained in that answer.这种假设是错误的,会导致该答案中解释的图形状态内容不正确。

Thus, one should patch marked content support out of PDF Clown as explained there to at least have proper graphics state information.因此,人们应该像那里解释的那样,从 PDF Clown 中修补标记的内容支持,以至少具有适当的图形状态信息。 Thereafter, obviously, you cannot properly process marked content unless you add correct support for it yourself.此后,显然,除非您自己添加正确的支持,否则您无法正确处理标记的内容。

Why PDF Clown Stops at the End of the First Stream为什么 PDF Clown 在第一个流的末尾停止

As you observed, PDF Clown stops not after the extra EMC but instead at the end of the first content stream.正如您所观察到的,PDF Clown 不是在额外的EMC之后停止,而是在第一个内容流的末尾停止。

This is due to the PDF Clown issue explained above: Based on the assumption that marked content sections and save/restore graphics state blocks are properly contained in each other, PDF Clown simply makes EMC and Q close the most recently opened and still open marked content section or save/restore graphics state block without checking whether it matches alright.这是由于上面解释的 PDF Clown 问题:基于标记内容部分和保存/恢复图形状态块彼此正确包含的假设,PDF Clown 只是让EMCQ关闭最近打开且仍然打开的标记内容部分或保存/恢复图形状态块而不检查它是否匹配正常。

Thus, it matches opening and closing operators in your stream like this:因此,它匹配流中的打开和关闭运算符,如下所示:

[Start of page content]
.  q
.  .  /OC /Heading BDC
.  .  EMC
.  EMC
.  /OC /Drawing BDC
.  EMC
Q

So for PDF Clown that last Q does not match the initial q in the content but the start of page content itself.因此,对于 PDF Clown,最后一个Q与内容中的初始q匹配,而是与页面内容本身的开头匹配。

I think that PDF Clown stops parsing here because it assumes it has found the end of page contents.我认为 PDF Clown 会在此处停止解析,因为它假定已找到页面内容的结尾。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM