简体   繁体   English

从 PDF 中提取 xdp 或 xfa

[英]Extract xdp or xfa from PDF

I created a PDF form with Adobe LiveCycle Designer.我使用 Adob​​e LiveCycle Designer 创建了一个 PDF 表单。 I'm now struggling to extract the data programmatically from the PDF after it's been filled out.我现在正在努力在填写后以编程方式从 PDF 中提取数据。

I tried to do this using poppler (the qt4 binding, but I guess that doesn't matter), but apparently poppler can't handle XFA forms.我尝试使用 poppler(qt4 绑定,但我想这无关紧要)来做到这一点,但显然 poppler 无法处理 XFA 表单。 Although evince and okular are able to display the form...虽然 evince 和 okular 能够显示表单...

As far as I understand, the PDF contains an XDP which in turn contains the XFA form.据我了解,PDF 包含一个 XDP,而 XDP 又包含 XFA 表单。 My question is, how can I extract that data from the PDF?我的问题是,如何从 PDF 中提取该数据?

If there are libraries, c++, java, python or PHP are my options.如果有库,c++、java、python 或 PHP 是我的选择。

The XML document (in XDP format ) that makes up the XFA is stored as the value of the XFA key in the AcroForm dictionary ( Interactive Form Dictionary ).组成 XFA 的 XML 文档( XDP 格式)作为XFA键的值存储在AcroForm字典(交互式表单字典)中。 The AcroForm dictionary is referenced from the Catalog dictionary ( Root of the PDF document). AcroForm字典引用自目录字典(PDF 文档的目录)。

The XFA value can be a stream or an array of streams. XFA值可以是一个流或一个流数组。 If it's a stream, it contains the entire XML document.如果它是一个流,则它包含整个 XML 文档。 If it's an array, the different streams contain the separate XDP packets.如果是数组,则不同的流包含单独的 XDP 数据包。 Concatenating them will give the full XML document.连接它们将提供完整的 XML 文档。

One of the XDP packets is the dataSets packet. XDP 数据包之一是dataSets数据包。 The actual form data will be in a child element of this packet: xfa:data .实际的表单数据将位于此数据包的子元素中: xfa:data Example:例子:

<xfa:dataSets xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">
  <xfa:data>
    <!-- arbitrary XML data, e.g.: -->
    <Employee>
      <FirstName>John</FirstName>
      <Name>Doe</Name>
    </Employee>
  </xfa:data>
</xfa:dataSets>

Any PDF library that offers low-level access to PDF objects can be used to extract the XML document.任何提供对 PDF 对象的低级访问的 PDF 库都可用于提取 XML 文档。 Simply navigate through Catalog > AcroForm > XFA .只需通过目录> AcroForm > XFA导航。

Some PDF libraries may offer a more high-level convenience method.一些 PDF 库可能提供更高级的便利方法。

( Disclaimer: I'm an iText Software employee. ) For example, using iText (Java) you can simply do this to get the XFA as an org.w3c.dom.Document : 免责声明:我是 iText Software 的员工。 )例如,使用 iText (Java),您可以简单地执行此操作以将 XFA 作为org.w3c.dom.Document

PdfReader reader = new PdfReader(pdfFile);
XfaForm xfa = reader.getAcroFields().getXfa();
org.w3c.dom.Document doc = xfa.getDomDocument();

Or to just get the dataSets packet as an org.w3c.dom.Node :或者只是将dataSets数据包作为org.w3c.dom.Node

org.w3c.dom.Node datasets = xfa.getDatasetsNode();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM