简体   繁体   English

在C#中使用abcpdf从PDF A / 3提取嵌入式XML文件-ZUGFeRD

[英]Extracting embedded XML File from PDF A/3 using abcpdf in C# - ZUGFeRD

I'm currently working with the new German ZUGFeRD files. 我目前正在使用新的德语ZUGFeRD文件。 These are PDF A/3 files who have an embedded XML file in them which contains data. 这些是PDF A / 3文件,其中具有包含数据的嵌入式XML文件。

I want to extract this XML file from the PDF A/3 using abcpdf 8.1 with C#. 我想使用带有C#的abcpdf 8.1从PDF A / 3中提取此XML文件。

Any idea how to do this ? 任何想法如何做到这一点?

Thanks a lot and regards, 非常感谢和问候,

I don't know abcpdf but I guess that the pdf libs offer similar access to the pdfs content. 我不知道abcpdf,但是我想pdf库提供对pdfs内容的类似访问。

First take a look at Das-ZUGFeRD-Format_1p0.pdf . 首先看一下Das-ZUGFeRD-Format_1p0.pdf Especially page 112. The images shows the object tree you have to traverse in order to find the xml stream. 尤其是第112页。这些图像显示了您必须遍历的对象树才能找到xml流。

With this tree you have the names, the types and the direction. 有了这棵树,您就有了名称,类型和方向。 Now you can traverse the pdf object tree to get to the XML content that you are looking for. 现在,您可以遍历pdf对象树以获得所需的XML内容。

The steps based on the diagram. 该步骤基于该图。

  1. Read your PDF 阅读您的PDF
  2. Get the catalog inside your PDF 在PDF中获取目录
  3. Get the Array with name AF from Catalog 从目录获取名称为AF的阵列
  4. Get first element from AF array (should be file spec ) AF数组中获取第一个元素(应该是file spec
  5. From file spec get the dictionary named EF file spec获取名为EF的字典
  6. Get the stream content of EF 获取EF的流内容

This are the steps you need to perform in order to get to the content. 这是您获得内容所需要执行的步骤。

To display the structure of a pdf and browse the tree I would recommend to use a tool like iText RUPS 要显示pdf的结构并浏览树,我建议使用iText RUPS之类的工具

What did i do with abcpdf: 我对abcpdf做了什么:

  • Get the Objectsoup Array from the Doc (Pretty much an array of all Objects in the Doc) 从文档中获取Objectsoup数组(文档中几乎所有对象的数组)

  • as ZUGFeRD allows only one embedded file inside the PDF, i just searched this objectsoup-array for the one of the type StreamObject that contains /EmbeddedFile 由于ZUGFeRD仅允许PDF内包含一个嵌入式文件,因此我只是在这个objectoup-array中搜索了包含/ EmbeddedFile的StreamObject类型之一。

  • Decompress the Stream of that object, get the byte[] of the stream and write it into an xml file 解压缩该对象的流,获取流的byte []并将其写入xml文件

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM