简体   繁体   English

如何使用可以使用PAC 2工具验证的Java PDFBox 2.0.8库创建可访问的PDF?

[英]How can I create an accessible PDF with Java PDFBox 2.0.8 library that is also verifiable with PAC 2 tool?

Background 背景

I have small project on GitHub in which I am trying to create a section 508 compliant (section508.gov) PDF which has form elements within a complex table structure. 我在GitHub上有一个小项目,我正在尝试创建一个符合508条款(section508.gov)的PDF,它在复杂的表结构中有表单元素。 The tool recommended to verify these PDFs is at http://www.access-for-all.ch/en/pdf-lab/pdf-accessibility-checker-pac.html and my program's output PDF does pass most of these checks. 建议验证这些PDF的工具位于http://www.access-for-all.ch/en/pdf-lab/pdf-accessibility-checker-pac.html ,我的程序输出PDF确实通过了大部分检查。 I will also know what every field is meant for at runtime, so adding tags to structure elements should not be an issue. 我还将了解每个字段在运行时的含义,因此向结构元素添加标签应该不是问题。

The Problem 问题

The PAC 2 tool seems to have an issue with two particular items in the output PDF. PAC 2工具似乎在输出PDF中存在两个特定项目的问题。 In particular, my radio buttons' widget annotations are not nested inside of a form structure element and my marked content is not tagged (Text and Table Cells). 特别是,我的单选按钮的窗口小部件注释不嵌套在表单结构元素内,并且我的标记内容没有标记(文本和表格单元格)。 PAC 2 verifies the P structure element that is within top-left cell but not the marked content PAC 2验证左上角单元格内的P 结构元素 ,但不验证标记内容 ...

However, PAC 2 does identify the marked content as an error (ie Text/Path object not tagged). 但是,PAC 2确实将标记的内容标识为错误(即未标记的文本/路径对象)。 Also, the radio button widgets are detected, but there seems to be no APIs to add them to a form structure element. 此外,检测单选按钮小部件 ,但似乎没有API将它们添加到表单结构元素。

What I Have Tried 我曾经尝试过什么

I have looked at several questions on this website and others on the subject including this one Tagged PDF with PDFBox , but it seems that there are almost no examples for PDF/UA and very little useful documentation (That I have found). 我已经看过这个网站上的几个问题以及其他关于这个主题的问题,包括这个带有PDFBox的Tagged PDF ,但似乎几乎没有PDF / UA的例子和很少有用的文档(我发现)。 The most useful tips that I have found have been at sites that explain specs for tagged PDFs like https://taggedpdf.com/508-pdf-help-center/object-not-tagged/ . 我发现的最有用的提示是在解释标记PDF的规范的网站上,如https://taggedpdf.com/508-pdf-help-center/object-not-tagged/

The Question 问题

Is it possible to create a PAC 2 verifiable PDF with Apache PDFBox that includes marked content and radio button widget annotations? 是否可以使用包含标记内容和单选按钮窗口小部件注释的Apache PDFBox创建PAC 2可验证PDF? If it is possible, is it doable using higher level (non-deprecated) PDFBox APIs? 如果可能,是否可以使用更高级别(不推荐)的PDFBox API?

Side Note: This is actually my first StackExchange question (Although I have used the site extensively) and I hope everything is in order! 旁注:这实际上是我的第一个StackExchange问​​题(尽管我已广泛使用该网站),我希望一切顺利! Feel free to add any necessary edits and ask any questions that I may need clarify. 随意添加任何必要的编辑,并询问我可能需要澄清的任何问题。 Also, I have an example program on GitHub which generates my PDF document at https://github.com/chris271/UAPDFBox . 另外,我在GitHub上有一个示例程序,它在https://github.com/chris271/UAPDFBox上生成我的PDF文档。

Edit 1: Direct link to Output PDF Document 编辑1:直接链接到输出PDF文档

*EDIT 2 : After using some of the lower-level PDFBox APIs and viewing raw data streams for fully compliant PDFs with PDFDebugger, I was able to generate a PDF with nearly identical content structure compared to the compliant PDF's content structure ... However, the same errors appear that the text objects are not tagged and I really can't decide where to go from here... Any guidance would be greatly appreciated! *编辑2 :使用一些较低级别的PDFBox API并使用PDFDebugger查看原始数据流以获得完全兼容的PDF后,我能够生成 兼容PDF的内容结构相比内容结构几乎相同 的PDF ...但是,相同的错误显示文本对象没有标记,我真的无法决定从这里去哪里...任何指导将不胜感激!

Edit 3: Side-by-side raw PDF content comparison. 编辑3: 并排原始PDF内容比较。

Edit 4: Internal structure of the generated PDF 编辑4:生成的PDF的内部结构

生成PDF

and the compliant PDF 和兼容的PDF

兼容的PDF

Edit 5: I have managed to fix the PAC 2 errors for tagged path/text objects thanks in part to suggestions from Tilman Hausherr! 编辑5:我已经成功修复了标记路径/文本对象的PAC 2错误,这部分得益于Tilman Hausherr的建议! I will add an answer if I manage to fix the issues regarding 'annotation widgets not being nested inside form structure elements'. 如果我设法解决有关“注释小部件没有嵌套在表单结构元素中”的问题,我将添加一个答案。

After going through a large amount of the PDF Spec and many PDFBox examples I was able to fix all issues reported by PAC 2. There were several steps involved to create the verified PDF (with a complex table structure) and the full source code is available here on github. 在浏览了大量PDF规范和许多PDFBox示例之后,我能够解决PAC 2报告的所有问题。创建经过验证的PDF(具有复杂的表结构)涉及几个步骤,并且完整的源代码可用这里是github。 I will attempt to do an overview of the major portions of the code below. 我将尝试概述下面代码的主要部分。 (Some method calls will not be explained here!) (这里不解释一些方法调用!)

Step 1 (Setup metadata) 第1步(设置元数据)

Various setup info like document title and language 各种设置信息,如文档标题和语言

//Setup new document
    pdf = new PDDocument();
    acroForm = new PDAcroForm(pdf);
    pdf.getDocumentInformation().setTitle(title);
    //Adjust other document metadata
    PDDocumentCatalog documentCatalog = pdf.getDocumentCatalog();
    documentCatalog.setLanguage("English");
    documentCatalog.setViewerPreferences(new PDViewerPreferences(new COSDictionary()));
    documentCatalog.getViewerPreferences().setDisplayDocTitle(true);
    documentCatalog.setAcroForm(acroForm);
    documentCatalog.setStructureTreeRoot(structureTreeRoot);
    PDMarkInfo markInfo = new PDMarkInfo();
    markInfo.setMarked(true);
    documentCatalog.setMarkInfo(markInfo);

Embed all fonts directly into resources. 将所有字体直接嵌入资源中。

//Set AcroForm Appearance Characteristics
    PDResources resources = new PDResources();
    defaultFont = PDType0Font.load(pdf,
            new PDTrueTypeFont(PDType1Font.HELVETICA.getCOSObject()).getTrueTypeFont(), true);
    resources.put(COSName.getPDFName("Helv"), defaultFont);
    acroForm.setNeedAppearances(true);
    acroForm.setXFA(null);
    acroForm.setDefaultResources(resources);
    acroForm.setDefaultAppearance(DEFAULT_APPEARANCE);

Add XMP Metadata for PDF/UA spec. 为PDF / UA规范添加XMP元数据。

//Add UA XMP metadata based on specs at https://taggedpdf.com/508-pdf-help-center/pdfua-identifier-missing/
    XMPMetadata xmp = XMPMetadata.createXMPMetadata();
    xmp.createAndAddDublinCoreSchema();
    xmp.getDublinCoreSchema().setTitle(title);
    xmp.getDublinCoreSchema().setDescription(title);
    xmp.createAndAddPDFAExtensionSchemaWithDefaultNS();
    xmp.getPDFExtensionSchema().addNamespace("http://www.aiim.org/pdfa/ns/schema#", "pdfaSchema");
    xmp.getPDFExtensionSchema().addNamespace("http://www.aiim.org/pdfa/ns/property#", "pdfaProperty");
    xmp.getPDFExtensionSchema().addNamespace("http://www.aiim.org/pdfua/ns/id/", "pdfuaid");
    XMPSchema uaSchema = new XMPSchema(XMPMetadata.createXMPMetadata(),
            "pdfaSchema", "pdfaSchema", "pdfaSchema");
    uaSchema.setTextPropertyValue("schema", "PDF/UA Universal Accessibility Schema");
    uaSchema.setTextPropertyValue("namespaceURI", "http://www.aiim.org/pdfua/ns/id/");
    uaSchema.setTextPropertyValue("prefix", "pdfuaid");
    XMPSchema uaProp = new XMPSchema(XMPMetadata.createXMPMetadata(),
            "pdfaProperty", "pdfaProperty", "pdfaProperty");
    uaProp.setTextPropertyValue("name", "part");
    uaProp.setTextPropertyValue("valueType", "Integer");
    uaProp.setTextPropertyValue("category", "internal");
    uaProp.setTextPropertyValue("description", "Indicates, which part of ISO 14289 standard is followed");
    uaSchema.addUnqualifiedSequenceValue("property", uaProp);
    xmp.getPDFExtensionSchema().addBagValue("schemas", uaSchema);
    xmp.getPDFExtensionSchema().setPrefix("pdfuaid");
    xmp.getPDFExtensionSchema().setTextPropertyValue("part", "1");
    XmpSerializer serializer = new XmpSerializer();
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    serializer.serialize(xmp, baos, true);
    PDMetadata metadata = new PDMetadata(pdf);
    metadata.importXMPMetadata(baos.toByteArray());
    pdf.getDocumentCatalog().setMetadata(metadata);

Step 2 (Setup document tag structure) 第2步(设置文档标签结构)

You will need to add the root structure element and all necessary structure elements as children to the root element. 您需要将根结构元素和所有必需的结构元素作为子元素添加到根元素。

//Adds a DOCUMENT structure element as the structure tree root.
void addRoot() {
    PDStructureElement root = new PDStructureElement(StandardStructureTypes.DOCUMENT, null);
    root.setAlternateDescription("The document's root structure element.");
    root.setTitle("PDF Document");
    pdf.getDocumentCatalog().getStructureTreeRoot().appendKid(root);
    currentElem = root;
    rootElem = root;
}

Each marked content element (text and background graphics) will need to have an MCID and an associated tag for reference in the parent tree which will be explained in step 3. 每个标记的内容元素(文本和背景图形)将需要具有MCID和相关标记以供在父树中参考,这将在步骤3中解释。

//Assign an id for the next marked content element.
private void setNextMarkedContentDictionary(String tag) {
    currentMarkedContentDictionary = new COSDictionary();
    currentMarkedContentDictionary.setName("Tag", tag);
    currentMarkedContentDictionary.setInt(COSName.MCID, currentMCID);
    currentMCID++;
}

Artifacts (background graphics) will not be detected by the screen reader. 屏幕阅读器不会检测到伪像(背景图形)。 Text needs to be detectable so a P structure element is used here when adding text. 文本需要是可检测的,因此在添加文本时使用P结构元素。

            //Set up the next marked content element with an MCID and create the containing TD structure element.
            PDPageContentStream contents = new PDPageContentStream(
                    pdf, pages.get(pageIndex), PDPageContentStream.AppendMode.APPEND, false);
            currentElem = addContentToParent(null, StandardStructureTypes.TD, pages.get(pageIndex), currentRow);

            //Make the actual cell rectangle and set as artifact to avoid detection.
            setNextMarkedContentDictionary(COSName.ARTIFACT.getName());
            contents.beginMarkedContent(COSName.ARTIFACT, PDPropertyList.create(currentMarkedContentDictionary));

            //Draws the cell itself with the given colors and location.
            drawDataCell(table.getCell(i, j).getCellColor(), table.getCell(i, j).getBorderColor(),
                    x + table.getRows().get(i).getCellPosition(j),
                    y + table.getRowPosition(i),
                    table.getCell(i, j).getWidth(), table.getRows().get(i).getHeight(), contents);
            contents.endMarkedContent();
            currentElem = addContentToParent(COSName.ARTIFACT, StandardStructureTypes.P, pages.get(pageIndex), currentElem);
            contents.close();
            //Draw the cell's text as a P structure element
            contents = new PDPageContentStream(
                    pdf, pages.get(pageIndex), PDPageContentStream.AppendMode.APPEND, false);
            setNextMarkedContentDictionary(COSName.P.getName());
            contents.beginMarkedContent(COSName.P, PDPropertyList.create(currentMarkedContentDictionary));
            //... Code to draw actual text...//
            //End the marked content and append it's P structure element to the containing TD structure element.
            contents.endMarkedContent();
            addContentToParent(COSName.P, null, pages.get(pageIndex), currentElem);
            contents.close();

Annotation Widgets (form objects in this case) will need to be nested within Form structure elements. 注释小部件(在这种情况下为表单对象)将需要嵌套在Form结构元素中。

//Add a radio button widget.
            if (!table.getCell(i, j).getRbVal().isEmpty()) {
                PDStructureElement fieldElem = new PDStructureElement(StandardStructureTypes.FORM, currentElem);
                radioWidgets.add(addRadioButton(
                        x + table.getRows().get(i).getCellPosition(j) -
                                radioWidgets.size() * 10 + table.getCell(i, j).getWidth() / 4,
                        y + table.getRowPosition(i),
                        table.getCell(i, j).getWidth() * 1.5f, 20,
                        radioValues, pageIndex, radioWidgets.size()));
                fieldElem.setPage(pages.get(pageIndex));
                COSArray kArray = new COSArray();
                kArray.add(COSInteger.get(currentMCID));
                fieldElem.getCOSObject().setItem(COSName.K, kArray);
                addWidgetContent(annotationRefs.get(annotationRefs.size() - 1), fieldElem, StandardStructureTypes.FORM, pageIndex);
            }

//Add a text field in the current cell.
            if (!table.getCell(i, j).getTextVal().isEmpty()) {
                PDStructureElement fieldElem = new PDStructureElement(StandardStructureTypes.FORM, currentElem);
                addTextField(x + table.getRows().get(i).getCellPosition(j),
                        y + table.getRowPosition(i),
                        table.getCell(i, j).getWidth(), table.getRows().get(i).getHeight(),
                        table.getCell(i, j).getTextVal(), pageIndex);
                fieldElem.setPage(pages.get(pageIndex));
                COSArray kArray = new COSArray();
                kArray.add(COSInteger.get(currentMCID));
                fieldElem.getCOSObject().setItem(COSName.K, kArray);
                addWidgetContent(annotationRefs.get(annotationRefs.size() - 1), fieldElem, StandardStructureTypes.FORM, pageIndex);
            }

Step 3 第3步

After all content elements have been written to the content stream and tag structure has been setup, it is necessary to go back and add the parent tree to the structure tree root. 在将所有内容元素写入内容流并且已设置标记结构之后,必须返回并将父树添加到结构树根。 Note: Some method calls (addWidgetContent() and addContentToParent()) in the above code setup the necessary COSDictionary objects. 注意:上面代码中的一些方法调用(addWidgetContent()和addContentToParent())设置了必要的COSDictionary对象。

//Adds the parent tree to root struct element to identify tagged content
void addParentTree() {
    COSDictionary dict = new COSDictionary();
    nums.add(numDictionaries);
    for (int i = 1; i < currentStructParent; i++) {
        nums.add(COSInteger.get(i));
        nums.add(annotDicts.get(i - 1));
    }
    dict.setItem(COSName.NUMS, nums);
    PDNumberTreeNode numberTreeNode = new PDNumberTreeNode(dict, dict.getClass());
    pdf.getDocumentCatalog().getStructureTreeRoot().setParentTreeNextKey(currentStructParent);
    pdf.getDocumentCatalog().getStructureTreeRoot().setParentTree(numberTreeNode);
}

If all widget annotations and marked content were added correctly to the structure tree and parent tree then you should get something like this from PAC 2 and PDFDebugger. 如果所有窗口小部件注释和标记内容都正确添加到结构树和父树,那么您应该从PAC 2和PDFDebugger获得类似的内容。

验证PDF

调试器

Thank you to Tilman Hausherr for pointing me in the right direction to solve this! 感谢Tilman Hausherr指出我正确的方向来解决这个问题! I will most likely make some edits to this answer for additional clarity as recommended by others. 我很可能会根据其他人的建议对这个答案进行一些编辑,以获得更多的清晰度。

Edit 1: 编辑1:

If you want to have a table structure like the one I have generated you will also need to add correct table markup to fully comply with the 508 standard... The 'Scope', 'ColSpan', 'RowSpan', or 'Headers' attributes will need to be correctly added to each table cell structure element similar to this or this . 如果你想拥有一个像我生成的那样的表结构,你还需要添加正确的表标记,以完全符合508标准......'Scope','ColSpan','RowSpan'或'Headers'需要将属性正确添加到与此类似的每个表格单元格结构元素中。 The main purpose for this markup is to allow a screen reading software like JAWS to read the table content in an understandable way. 此标记的主要目的是允许像JAWS这样的屏幕阅读软件以可理解的方式读取表格内容。 These attributes can be added in a similar way as below... 这些属性可以通过以下类似的方式添加...

private void addTableCellMarkup(Cell cell, int pageIndex, PDStructureElement currentRow) {
    COSDictionary cellAttr = new COSDictionary();
    cellAttr.setName(COSName.O, "Table");
    if (cell.getCellMarkup().isHeader()) {
        currentElem = addContentToParent(null, StandardStructureTypes.TH, pages.get(pageIndex), currentRow);
        currentElem.getCOSObject().setString(COSName.ID, cell.getCellMarkup().getId());
        if (cell.getCellMarkup().getScope().length() > 0) {
            cellAttr.setName(COSName.getPDFName("Scope"), cell.getCellMarkup().getScope());
        }
        if (cell.getCellMarkup().getColspan() > 1) {
            cellAttr.setInt(COSName.getPDFName("ColSpan"), cell.getCellMarkup().getColspan());
        }
        if (cell.getCellMarkup().getRowSpan() > 1) {
            cellAttr.setInt(COSName.getPDFName("RowSpan"), cell.getCellMarkup().getRowSpan());
        }
    } else {
        currentElem = addContentToParent(null, StandardStructureTypes.TD, pages.get(pageIndex), currentRow);
    }
    if (cell.getCellMarkup().getHeaders().length > 0) {
        COSArray headerA = new COSArray();
        for (String s : cell.getCellMarkup().getHeaders()) {
            headerA.add(new COSString(s));
        }
        cellAttr.setItem(COSName.getPDFName("Headers"), headerA);
    }
    currentElem.getCOSObject().setItem(COSName.A, cellAttr);
}

Be sure to do something like currentElem.setAlternateDescription(currentCell.getText()); 一定要做像currentElem.setAlternateDescription(currentCell.getText()); on each of the structure elements with text marked content for JAWS to read the text. 在每个结构元素上都有文本标记内容供JAWS阅读文本。

Note: Each of the fields (radio button and textbox) will need a unique name to avoid setting multiple field values. 注意:每个字段(单选按钮和文本框)都需要一个唯一的名称,以避免设置多个字段值。 GitHub has been updated with a more complex example PDF with table markup and improved form fields! GitHub已经更新了一个更复杂的示例PDF,带有表格标记和改进的表单字段!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM