如何使用iText 7（或其他）从Java中的XFA PDF文档中提取XML？

Question

使用Java和iText 7，我试图从XFA PDF表单中确定XML数据，以便解析（并可能修改）数据，但我可以设法做的就是获取一些对于任何XFA都相同的基本通用数据我用的文件。

我知道它必须是可能的，因为它是在iText RUPS工具中完成的，但我已经在圈子里呆了几天了。

public class Parse {

    private PdfDocument pdf;
    private PdfAcroForm form;
    private XfaForm xfa;
    private Document domDocument;
    private Map<Integer, String> data;
    private int numberOfPages;
    private String pdfText;

    public void openPdf(String src, String dest) throws IOException, TransformerException {

        PdfReader reader = new PdfReader(src);
        reader.setUnethicalReading(true);
        pdf = new PdfDocument(reader, new PdfWriter(dest));
        form = PdfAcroForm.getAcroForm(pdf, true);

        data = new HashMap<Integer, String>();
        numberOfPages = getNumberOfPdfPages();
        PdfPage currentPage;
        String textFromPage;

        for (int page = 1; page <= numberOfPages; page++) {
            System.out.println("Reading page: " + page + " -----------------");
            currentPage = pdf.getPage(page);
            textFromPage = PdfTextExtractor.getTextFromPage(currentPage);
            data.put(page, textFromPage);
            pdfText += currentPage + ":" + "\n" + textFromPage + "\n";
        }


        xfa = form.getXfaForm();
        domDocument = xfa.getDomDocument();
        Map<String, Node> map = xfa.extractXFANodes(domDocument);

        System.out.println("The template node = " + map.get("template").toString() + "\n");
        System.out.println("Dom document = " + domDocument.toString() + "\n");
        System.out.println("In map form = " + map.toString() + "\n");   
        System.out.println("pdfText = " + pdfText + "\n");

        Node node = xfa.getDatasetsNode();
        NodeList list = node.getChildNodes();

        for (int i = 0; i < list.getLength(); i++) {
            System.out.println("Get Child Nodes Output = " + list.item(i) + "\n");
        }

    }
}

这是我收到的通用输出。

Reading page: 1 -----------------
The template node = [template: null]

Dom document = [#document: null]

In map form = {template=[template: null], form=[form: null], xfdf=[xfdf: null], xmpmeta=[x:xmpmeta: null], datasets=[xfa:datasets: null], config=[config: null], PDFSecurity=[PDFSecurity: null]}

pdfText = nullcom.itextpdf.kernel.pdf.PdfPage@6fa38a:

> Please wait... 
> 
> If this message is not eventually replaced by the proper contents of
> the document, your PDF  viewer may not be able to display this type of
> document.     You can upgrade to the latest version of Adobe Reader
> for Windows®, Mac, or Linux® by  visiting 
> http://www.adobe.com/go/reader_download.     For more assistance with
> Adobe Reader visit  http://www.adobe.com/go/acrreader.     Windows is
> either a registered trademark or a trademark of Microsoft Corporation
> in the United States and/or other countries. Mac is a trademark  of
> Apple Inc., registered in the United States and other countries. Linux
> is the registered trademark of Linus Torvalds in the U.S. and other 
> countries.

Get Child Nodes Output = [xfa:data: null]

Answer 1

您有一个纯XFA文件的文件。 这意味着存储在此文件中的唯一PDF内容包含“请稍候...”消息。 该页面显示在PDF查看器中，该查看器不知道如何呈现XFA。

它也是您使用以下内容从页面中提取内容时获得的内容：

currentPage = pdf.getPage(page);
textFromPage = PdfTextExtractor.getTextFromPage(currentPage);

这是面对纯XFA文件时不应该做的事情，因为所有相关内容都存储在PDF文件中存储的XML流中。

你已经拥有了第一部分：

xfa = form.getXfaForm();
domDocument = xfa.getDomDocument();

可以在/AcroForm条目中找到XFA流。 我知道这很尴尬，但这就是PDF的设计方式。 这不是我们的选择，XFA在PDF 2.0中已被弃用，因此无论如何XFA都在濒临死亡。 当XFA最终死亡和埋葬时，问题将消失。

这就是说，你有一个org.w3c.dom.Document的实例，并且你想获得存储在这个对象中的XML文件。 您不需要iText来执行此操作。 例如，在使用Transformer将Java中的org.w3c.dom.Document转换为String时就解释了这一点

我使用此代码段在XFA文件上测试了该代码：

public static void main(String[] args) throws IOException, TransformerException {
    PdfDocument pdf = new PdfDocument(new PdfReader(SRC));
    PdfAcroForm form = PdfAcroForm.getAcroForm(pdf, true);
    XfaForm xfa = form.getXfaForm();
    Document doc = xfa.getDomDocument();
    DOMSource domSource = new DOMSource(doc);
    StringWriter writer = new StringWriter();
    StreamResult result = new StreamResult(writer);
    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer transformer = tf.newTransformer();
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty(OutputKeys.INDENT, "yes");
    transformer.transform(domSource, result);
    writer.flush();
    System.out.println(writer.toString());
}

屏幕输出是XDP XML文件，包含我期望的所有XFA信息。

请注意，在替换XFA XML文件时要小心。 最好不要干涉XFA结构，而是创建一个XML文件，只包含使用适当模式创建的数据，并填写表格，如FAQ：如何以编程方式填写pdf文件？ （动态XFA）

如何使用iText 7（或其他）从Java中的XFA PDF文档中提取XML？

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-12-12 21:38:49

如何使用iText 7（或其他）从Java中的XFA PDF文档中提取XML？

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-12-12 21:38:49

解决方案1
3 已采纳 2017-12-12 21:38:49