如何使用可以使用PAC 2工具驗證的Java PDFBox 2.0.8庫創建可訪問的PDF？

Question

背景

我在GitHub上有一個小項目，我正在嘗試創建一個符合508條款（section508.gov）的PDF，它在復雜的表結構中有表單元素。 建議驗證這些PDF的工具位於http://www.access-for-all.ch/en/pdf-lab/pdf-accessibility-checker-pac.html ，我的程序輸出PDF確實通過了大部分檢查。 我還將了解每個字段在運行時的含義，因此向結構元素添加標簽應該不是問題。

問題

PAC 2工具似乎在輸出PDF中存在兩個特定項目的問題。 特別是，我的單選按鈕的窗口小部件注釋不嵌套在表單結構元素內，並且我的標記內容沒有標記（文本和表格單元格）。 PAC 2驗證左上角單元格內的P 結構元素，但不驗證標記內容 ...

但是，PAC 2確實將標記的內容標識為錯誤（即未標記的文本/路徑對象）。 此外，檢測單選按鈕小部件，但似乎沒有API將它們添加到表單結構元素。

我曾經嘗試過什么

我已經看過這個網站上的幾個問題以及其他關於這個主題的問題，包括這個帶有PDFBox的Tagged PDF ，但似乎幾乎沒有PDF / UA的例子和很少有用的文檔（我發現）。 我發現的最有用的提示是在解釋標記PDF的規范的網站上，如https://taggedpdf.com/508-pdf-help-center/object-not-tagged/ 。

問題

是否可以使用包含標記內容和單選按鈕窗口小部件注釋的Apache PDFBox創建PAC 2可驗證PDF？ 如果可能，是否可以使用更高級別（不推薦）的PDFBox API？

旁注：這實際上是我的第一個StackExchange問題（盡管我已廣泛使用該網站），我希望一切順利！ 隨意添加任何必要的編輯，並詢問我可能需要澄清的任何問題。 另外，我在GitHub上有一個示例程序，它在https://github.com/chris271/UAPDFBox上生成我的PDF文檔。

編輯1：直接鏈接到輸出PDF文檔

*編輯2 ：使用一些較低級別的PDFBox API並使用PDFDebugger查看原始數據流以獲得完全兼容的PDF后，我能夠生成與兼容PDF的內容結構相比內容結構幾乎相同的PDF ...但是，相同的錯誤顯示文本對象沒有標記，我真的無法決定從這里去哪里...任何指導將不勝感激！

編輯3： 並排原始PDF內容比較。

編輯4：生成的PDF的內部結構

和兼容的PDF

編輯5：我已經成功修復了標記路徑/文本對象的PAC 2錯誤，這部分得益於Tilman Hausherr的建議！ 如果我設法解決有關“注釋小部件沒有嵌套在表單結構元素中”的問題，我將添加一個答案。

Answer 1

在瀏覽了大量PDF規范和許多PDFBox示例之后，我能夠解決PAC 2報告的所有問題。創建經過驗證的PDF（具有復雜的表結構）涉及幾個步驟，並且完整的源代碼可用這里是github。 我將嘗試概述下面代碼的主要部分。 （這里不解釋一些方法調用！）

第1步（設置元數據）

各種設置信息，如文檔標題和語言

//Setup new document
    pdf = new PDDocument();
    acroForm = new PDAcroForm(pdf);
    pdf.getDocumentInformation().setTitle(title);
    //Adjust other document metadata
    PDDocumentCatalog documentCatalog = pdf.getDocumentCatalog();
    documentCatalog.setLanguage("English");
    documentCatalog.setViewerPreferences(new PDViewerPreferences(new COSDictionary()));
    documentCatalog.getViewerPreferences().setDisplayDocTitle(true);
    documentCatalog.setAcroForm(acroForm);
    documentCatalog.setStructureTreeRoot(structureTreeRoot);
    PDMarkInfo markInfo = new PDMarkInfo();
    markInfo.setMarked(true);
    documentCatalog.setMarkInfo(markInfo);

將所有字體直接嵌入資源中。

//Set AcroForm Appearance Characteristics
    PDResources resources = new PDResources();
    defaultFont = PDType0Font.load(pdf,
            new PDTrueTypeFont(PDType1Font.HELVETICA.getCOSObject()).getTrueTypeFont(), true);
    resources.put(COSName.getPDFName("Helv"), defaultFont);
    acroForm.setNeedAppearances(true);
    acroForm.setXFA(null);
    acroForm.setDefaultResources(resources);
    acroForm.setDefaultAppearance(DEFAULT_APPEARANCE);

為PDF / UA規范添加XMP元數據。

//Add UA XMP metadata based on specs at https://taggedpdf.com/508-pdf-help-center/pdfua-identifier-missing/
    XMPMetadata xmp = XMPMetadata.createXMPMetadata();
    xmp.createAndAddDublinCoreSchema();
    xmp.getDublinCoreSchema().setTitle(title);
    xmp.getDublinCoreSchema().setDescription(title);
    xmp.createAndAddPDFAExtensionSchemaWithDefaultNS();
    xmp.getPDFExtensionSchema().addNamespace("http://www.aiim.org/pdfa/ns/schema#", "pdfaSchema");
    xmp.getPDFExtensionSchema().addNamespace("http://www.aiim.org/pdfa/ns/property#", "pdfaProperty");
    xmp.getPDFExtensionSchema().addNamespace("http://www.aiim.org/pdfua/ns/id/", "pdfuaid");
    XMPSchema uaSchema = new XMPSchema(XMPMetadata.createXMPMetadata(),
            "pdfaSchema", "pdfaSchema", "pdfaSchema");
    uaSchema.setTextPropertyValue("schema", "PDF/UA Universal Accessibility Schema");
    uaSchema.setTextPropertyValue("namespaceURI", "http://www.aiim.org/pdfua/ns/id/");
    uaSchema.setTextPropertyValue("prefix", "pdfuaid");
    XMPSchema uaProp = new XMPSchema(XMPMetadata.createXMPMetadata(),
            "pdfaProperty", "pdfaProperty", "pdfaProperty");
    uaProp.setTextPropertyValue("name", "part");
    uaProp.setTextPropertyValue("valueType", "Integer");
    uaProp.setTextPropertyValue("category", "internal");
    uaProp.setTextPropertyValue("description", "Indicates, which part of ISO 14289 standard is followed");
    uaSchema.addUnqualifiedSequenceValue("property", uaProp);
    xmp.getPDFExtensionSchema().addBagValue("schemas", uaSchema);
    xmp.getPDFExtensionSchema().setPrefix("pdfuaid");
    xmp.getPDFExtensionSchema().setTextPropertyValue("part", "1");
    XmpSerializer serializer = new XmpSerializer();
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    serializer.serialize(xmp, baos, true);
    PDMetadata metadata = new PDMetadata(pdf);
    metadata.importXMPMetadata(baos.toByteArray());
    pdf.getDocumentCatalog().setMetadata(metadata);

第2步（設置文檔標簽結構）

您需要將根結構元素和所有必需的結構元素作為子元素添加到根元素。

//Adds a DOCUMENT structure element as the structure tree root.
void addRoot() {
    PDStructureElement root = new PDStructureElement(StandardStructureTypes.DOCUMENT, null);
    root.setAlternateDescription("The document's root structure element.");
    root.setTitle("PDF Document");
    pdf.getDocumentCatalog().getStructureTreeRoot().appendKid(root);
    currentElem = root;
    rootElem = root;
}

每個標記的內容元素（文本和背景圖形）將需要具有MCID和相關標記以供在父樹中參考，這將在步驟3中解釋。

//Assign an id for the next marked content element.
private void setNextMarkedContentDictionary(String tag) {
    currentMarkedContentDictionary = new COSDictionary();
    currentMarkedContentDictionary.setName("Tag", tag);
    currentMarkedContentDictionary.setInt(COSName.MCID, currentMCID);
    currentMCID++;
}

屏幕閱讀器不會檢測到偽像（背景圖形）。 文本需要是可檢測的，因此在添加文本時使用P結構元素。

            //Set up the next marked content element with an MCID and create the containing TD structure element.
            PDPageContentStream contents = new PDPageContentStream(
                    pdf, pages.get(pageIndex), PDPageContentStream.AppendMode.APPEND, false);
            currentElem = addContentToParent(null, StandardStructureTypes.TD, pages.get(pageIndex), currentRow);

            //Make the actual cell rectangle and set as artifact to avoid detection.
            setNextMarkedContentDictionary(COSName.ARTIFACT.getName());
            contents.beginMarkedContent(COSName.ARTIFACT, PDPropertyList.create(currentMarkedContentDictionary));

            //Draws the cell itself with the given colors and location.
            drawDataCell(table.getCell(i, j).getCellColor(), table.getCell(i, j).getBorderColor(),
                    x + table.getRows().get(i).getCellPosition(j),
                    y + table.getRowPosition(i),
                    table.getCell(i, j).getWidth(), table.getRows().get(i).getHeight(), contents);
            contents.endMarkedContent();
            currentElem = addContentToParent(COSName.ARTIFACT, StandardStructureTypes.P, pages.get(pageIndex), currentElem);
            contents.close();
            //Draw the cell's text as a P structure element
            contents = new PDPageContentStream(
                    pdf, pages.get(pageIndex), PDPageContentStream.AppendMode.APPEND, false);
            setNextMarkedContentDictionary(COSName.P.getName());
            contents.beginMarkedContent(COSName.P, PDPropertyList.create(currentMarkedContentDictionary));
            //... Code to draw actual text...//
            //End the marked content and append it's P structure element to the containing TD structure element.
            contents.endMarkedContent();
            addContentToParent(COSName.P, null, pages.get(pageIndex), currentElem);
            contents.close();

注釋小部件（在這種情況下為表單對象）將需要嵌套在Form結構元素中。

//Add a radio button widget.
            if (!table.getCell(i, j).getRbVal().isEmpty()) {
                PDStructureElement fieldElem = new PDStructureElement(StandardStructureTypes.FORM, currentElem);
                radioWidgets.add(addRadioButton(
                        x + table.getRows().get(i).getCellPosition(j) -
                                radioWidgets.size() * 10 + table.getCell(i, j).getWidth() / 4,
                        y + table.getRowPosition(i),
                        table.getCell(i, j).getWidth() * 1.5f, 20,
                        radioValues, pageIndex, radioWidgets.size()));
                fieldElem.setPage(pages.get(pageIndex));
                COSArray kArray = new COSArray();
                kArray.add(COSInteger.get(currentMCID));
                fieldElem.getCOSObject().setItem(COSName.K, kArray);
                addWidgetContent(annotationRefs.get(annotationRefs.size() - 1), fieldElem, StandardStructureTypes.FORM, pageIndex);
            }

//Add a text field in the current cell.
            if (!table.getCell(i, j).getTextVal().isEmpty()) {
                PDStructureElement fieldElem = new PDStructureElement(StandardStructureTypes.FORM, currentElem);
                addTextField(x + table.getRows().get(i).getCellPosition(j),
                        y + table.getRowPosition(i),
                        table.getCell(i, j).getWidth(), table.getRows().get(i).getHeight(),
                        table.getCell(i, j).getTextVal(), pageIndex);
                fieldElem.setPage(pages.get(pageIndex));
                COSArray kArray = new COSArray();
                kArray.add(COSInteger.get(currentMCID));
                fieldElem.getCOSObject().setItem(COSName.K, kArray);
                addWidgetContent(annotationRefs.get(annotationRefs.size() - 1), fieldElem, StandardStructureTypes.FORM, pageIndex);
            }

第3步

在將所有內容元素寫入內容流並且已設置標記結構之后，必須返回並將父樹添加到結構樹根。 注意：上面代碼中的一些方法調用（addWidgetContent（）和addContentToParent（））設置了必要的COSDictionary對象。

//Adds the parent tree to root struct element to identify tagged content
void addParentTree() {
    COSDictionary dict = new COSDictionary();
    nums.add(numDictionaries);
    for (int i = 1; i < currentStructParent; i++) {
        nums.add(COSInteger.get(i));
        nums.add(annotDicts.get(i - 1));
    }
    dict.setItem(COSName.NUMS, nums);
    PDNumberTreeNode numberTreeNode = new PDNumberTreeNode(dict, dict.getClass());
    pdf.getDocumentCatalog().getStructureTreeRoot().setParentTreeNextKey(currentStructParent);
    pdf.getDocumentCatalog().getStructureTreeRoot().setParentTree(numberTreeNode);
}

如果所有窗口小部件注釋和標記內容都正確添加到結構樹和父樹，那么您應該從PAC 2和PDFDebugger獲得類似的內容。

感謝Tilman Hausherr指出我正確的方向來解決這個問題！ 我很可能會根據其他人的建議對這個答案進行一些編輯，以獲得更多的清晰度。

編輯1：

如果你想擁有一個像我生成的那樣的表結構，你還需要添加正確的表標記，以完全符合508標准......'Scope'，'ColSpan'，'RowSpan'或'Headers'需要將屬性正確添加到與此或此類似的每個表格單元格結構元素中。 此標記的主要目的是允許像JAWS這樣的屏幕閱讀軟件以可理解的方式讀取表格內容。 這些屬性可以通過以下類似的方式添加...

private void addTableCellMarkup(Cell cell, int pageIndex, PDStructureElement currentRow) {
    COSDictionary cellAttr = new COSDictionary();
    cellAttr.setName(COSName.O, "Table");
    if (cell.getCellMarkup().isHeader()) {
        currentElem = addContentToParent(null, StandardStructureTypes.TH, pages.get(pageIndex), currentRow);
        currentElem.getCOSObject().setString(COSName.ID, cell.getCellMarkup().getId());
        if (cell.getCellMarkup().getScope().length() > 0) {
            cellAttr.setName(COSName.getPDFName("Scope"), cell.getCellMarkup().getScope());
        }
        if (cell.getCellMarkup().getColspan() > 1) {
            cellAttr.setInt(COSName.getPDFName("ColSpan"), cell.getCellMarkup().getColspan());
        }
        if (cell.getCellMarkup().getRowSpan() > 1) {
            cellAttr.setInt(COSName.getPDFName("RowSpan"), cell.getCellMarkup().getRowSpan());
        }
    } else {
        currentElem = addContentToParent(null, StandardStructureTypes.TD, pages.get(pageIndex), currentRow);
    }
    if (cell.getCellMarkup().getHeaders().length > 0) {
        COSArray headerA = new COSArray();
        for (String s : cell.getCellMarkup().getHeaders()) {
            headerA.add(new COSString(s));
        }
        cellAttr.setItem(COSName.getPDFName("Headers"), headerA);
    }
    currentElem.getCOSObject().setItem(COSName.A, cellAttr);
}

一定要做像currentElem.setAlternateDescription(currentCell.getText()); 在每個結構元素上都有文本標記內容供JAWS閱讀文本。

注意：每個字段（單選按鈕和文本框）都需要一個唯一的名稱，以避免設置多個字段值。 GitHub已經更新了一個更復雜的示例PDF，帶有表格標記和改進的表單字段！

如何使用可以使用PAC 2工具驗證的Java PDFBox 2.0.8庫創建可訪問的PDF？

問題描述

1 個解決方案

解決方案1
9 已采納 2018-04-17 22:23:43

如何使用可以使用PAC 2工具驗證的Java PDFBox 2.0.8庫創建可訪問的PDF？

問題描述

1 個解決方案

解決方案1 9 已采納 2018-04-17 22:23:43

解決方案1
9 已采納 2018-04-17 22:23:43