简体   繁体   English

如何知道某个字段是否在特定页面上?

[英]how to know if a field is on a particular page?

The PDFbox content stream is done per page, but the fields come from the form which comes from the catalog, which comes from the pdf doc itself. PDFbox 内容流是按页完成的,但字段来自来自目录的表单,来自于 pdf 文档本身。 So I'm not sure which fields are on which pages, and its causing to write text out to incorrect locations/pages.所以我不确定哪些字段在哪些页面上,以及它导致将文本写到不正确的位置/页面。

ie. IE。 I'm processing fields per page, but not sure which fields are on which pages.我正在每页处理字段,但不确定哪些字段在哪些页面上。

Is there a way to tell which field is on which page?有没有办法知道哪个字段在哪个页面上? Or, is there a way to get just the fields on the current page?或者,有没有办法只获取当前页面上的字段?

Thank you!谢谢!

Mark标记

code snippet:代码片段:

PDDocument pdfDoc = PDDocument.load(file);
PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();

// Get field names
List<PDField> fieldList = acroForm.getFields();
List<PDPage> pages = pdfDoc.getDocumentCatalog().getAllPages();
for (PDPage page : pages) {
  PDPageContentStream contentStream = new PDPageContentStream(pdfDoc, page, true, true, true);
  processFields(acroForm, fieldList, contentStream, page);
  contentStream.close();
}

The PDFbox content stream is done per page, but the fields come from the form which comes from the catalog, which comes from the pdf doc itself. PDFbox 内容流是按页完成的,但字段来自来自目录的表单,来自于 pdf 文档本身。 So I'm not sure which fields are on which pages所以我不确定哪些字段在哪些页面上

The reason for this is that PDFs contain a global object structure defining the form.这样做的原因是 PDF 包含定义表单的全局对象结构。 A form field in this structure may have 0, 1, or more visualizations on 0, 1, or more actual PDF pages.此结构中的表单字段可能在 0、1 或更多实际 PDF 页面上具有 0、1 或更多可视化。 Furthermore, in case of only 1 visualization, a merge of field object and visualization object is allowed.此外,在只有 1 个可视化的情况下,允许字段对象和可视化对象的合并。

PDFBox 1.8.x PDFBox 1.8.x

Unfortunately PDFBox in its PDAcroForm and PDField objects represents only this object structure and does not provide easy access to the associated pages.不幸的是,其PDAcroFormPDField对象中的 PDFBox 仅表示此对象结构,并不能轻松访问相关页面。 By accessing the underlying structures, though, you can build the connection.但是,通过访问底层结构,您可以建立连接。

The following code should make clear how to do that:下面的代码应该清楚如何做到这一点:

@SuppressWarnings("unchecked")
public void printFormFields(PDDocument pdfDoc) throws IOException {
    PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();

    List<PDPage> pages = docCatalog.getAllPages();
    Map<COSDictionary, Integer> pageNrByAnnotDict = new HashMap<COSDictionary, Integer>();
    for (int i = 0; i < pages.size(); i++) {
        PDPage page = pages.get(i);
        for (PDAnnotation annotation : page.getAnnotations())
            pageNrByAnnotDict.put(annotation.getDictionary(), i + 1);
    }

    PDAcroForm acroForm = docCatalog.getAcroForm();

    for (PDField field : (List<PDField>)acroForm.getFields()) {
        COSDictionary fieldDict = field.getDictionary();

        List<Integer> annotationPages = new ArrayList<Integer>();
        List<COSObjectable> kids = field.getKids();
        if (kids != null) {
            for (COSObjectable kid : kids) {
                COSBase kidObject = kid.getCOSObject();
                if (kidObject instanceof COSDictionary)
                    annotationPages.add(pageNrByAnnotDict.get(kidObject));
            }
        }

        Integer mergedPage = pageNrByAnnotDict.get(fieldDict);

        if (mergedPage == null)
            if (annotationPages.isEmpty())
                System.out.printf("i Field '%s' not referenced (invisible).\n", field.getFullyQualifiedName());
            else
                System.out.printf("a Field '%s' referenced by separate annotation on %s.\n", field.getFullyQualifiedName(), annotationPages);
        else
            if (annotationPages.isEmpty())
                System.out.printf("m Field '%s' referenced as merged on %s.\n", field.getFullyQualifiedName(), mergedPage);
            else
                System.out.printf("x Field '%s' referenced as merged on %s and by separate annotation on %s. (Not allowed!)\n", field.getFullyQualifiedName(), mergedPage, annotationPages);
    }
}

Beware , there are two shortcomings in the PDFBox PDAcroForm form field handling:请注意,PDFBox PDAcroForm表单字段处理有两个缺点:

  1. The PDF specification allows the global object structure defining the form to be a deep tree, ie the actual fields do not have to be direct children of the root but may be organized by means of inner tree nodes. PDF 规范允许定义表单的全局对象结构是一棵深树,即实际字段不必是根的直接子项,而是可以通过内部树节点来组织。 PDFBox ignores this and expects the fields to be direct children of the root. PDFBox 忽略这一点,并期望字段是根的直接子项。

  2. Some PDFs in the wild, foremost older ones, do not contain the field tree but only reference the field objects from the pages via the visualizing widget annotations.一些流行的 PDF(最旧的 PDF)不包含字段树,而仅通过可视化小部件注释从页面中引用字段对象。 PDFBox does not see these fields in its PDAcroForm.getFields list. PDFBox 在其PDAcroForm.getFields列表PDAcroForm.getFields不到这些字段。

PS: @mikhailvs in his answer correctly shows that you can retrieve a page object from a field widget using PDField.getWidget().getPage() and determine its page number using catalog.getAllPages().indexOf . PS: @mikhailvs他的回答中正确地表明您可以使用PDField.getWidget().getPage()从字段小部件中检索页面对象,并使用catalog.getAllPages().indexOf确定其页码。 While being fast this getPage() method has a drawback: It retrieves the page reference from an optional entry of the widget annotation dictionary.虽然速度很快,但getPage()方法有一个缺点:它从小部件注释字典的可选条目中检索页面引用。 Thus, if the PDF you process has been created by software that fills that entry, all is well, but if the PDF creator has not filled that entry, all you get is a null page.因此,如果您处理的 PDF 是由填充该条目的软件创建的,则一切正常,但如果 PDF 创建者未填充该条目,则您得到的只是一个null页面。

PDFBox 2.0.x PDFBox 2.0.x

In 2.0.x some methods for accessing the elements in question have changed but not the situation as a whole, to safely retrieve the page of a widget you still have to iterate through the pages and find a page that references the annotation.在 2.0.x 中,一些访问相关元素的方法发生了变化,但整体情况没有发生变化,要安全地检索小部件的页面,您仍然必须遍历页面并找到引用注释的页面。

The safe methods:安全的方法:

int determineSafe(PDDocument document, PDAnnotationWidget widget) throws IOException
{
    COSDictionary widgetObject = widget.getCOSObject();
    PDPageTree pages = document.getPages();
    for (int i = 0; i < pages.getCount(); i++)
    {
        for (PDAnnotation annotation : pages.get(i).getAnnotations())
        {
            COSDictionary annotationObject = annotation.getCOSObject();
            if (annotationObject.equals(widgetObject))
                return i;
        }
    }
    return -1;
}

The fast method快速的方法

int determineFast(PDDocument document, PDAnnotationWidget widget)
{
    PDPage page = widget.getPage();
    return page != null ? document.getPages().indexOf(page) : -1;
}

Usage:用法:

PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
if (acroForm != null)
{
    for (PDField field : acroForm.getFieldTree())
    {
        System.out.println(field.getFullyQualifiedName());
        for (PDAnnotationWidget widget : field.getWidgets())
        {
            System.out.print(widget.getAnnotationName() != null ? widget.getAnnotationName() : "(NN)");
            System.out.printf(" - fast: %s", determineFast(document, widget));
            System.out.printf(" - safe: %s\n", determineSafe(document, widget));
        }
    }
}

( DetermineWidgetPage.java ) 确定WidgetPage.java

(In contrast to the 1.8.x code the safe method here simply searches for the page of a single field. If in your code you have to determine the page of many widgets, you should create a lookup Map like in the 1.8.x case.) (与 1.8.x 代码相反,这里的 safe 方法只是搜索单个字段的页面。如果在您的代码中您必须确定许多小部件的页面,您应该像在 1.8.x 的情况下创建一个查找Map .)

Example documents示例文件

A document for which the fast method fails: aFieldTwice.pdf快速方法失败的文档: aFieldTwice.pdf

A document for which the fast method works: test_duplicate_field2.pdf快速方法适用的文档: test_duplicate_field2.pdf

Granted this answer may not help the OP (a year later), but if someone else runs into it, here is the solution:授予这个答案可能对 OP 没有帮助(一年后),但如果其他人遇到它,这里是解决方案:

PDDocumentCatalog catalog = doc.getDocumentCatalog();

int pageNumber = catalog.getAllPages().indexOf(yourField.getWidget().getPage());

This example uses Lucee (cfml) https://lucee.org/此示例使用 Lucee (cfml) https://lucee.org/

A big thank you to mkl as his answer above is invaluable and I couldn't have built this function without his help.非常感谢 mkl,因为他上面的回答非常宝贵,如果没有他的帮助,我无法构建此功能。

Call the function: pageForSignature(doc, fieldName) and it will return the page no that the fieldname resides on.调用函数: pageForSignature(doc, fieldName) 它将返回字段名所在的页面编号。 Returns -1 if fieldName not found.如果未找到 fieldName,则返回 -1。

  <cfscript>
  try{

  /*
  java is used by using CreateObject()
  */

  variables.File = CreateObject("java", "java.io.File");

  //references lucee bundle directory - typically on tomcat: /usr/local/tomcat/lucee-server/bundles
  variables.PDDocument = CreateObject("java", "org.apache.pdfbox.pdmodel.PDDocument", "org.apache.pdfbox.app", "2.0.18")

  function determineSafe(doc, widget){

    var i = '';
    var widgetObject = widget.getCOSObject();
    var pages = doc.getPages();
    var annotation = '';
    var annotationObject = '';

    for (i = 0; i < pages.getCount(); i=i+1){

    for (annotation in pages.get(i).getAnnotations()){
        if(annotation.getSubtype() eq 'widget'){
            annotationObject = annotation.getCOSObject();
            if (annotationObject.equals(widgetObject)){
                return i;
            }
        }
    }

    }
    return -1;
  }

  function pageForSignature(doc, fieldName){
    try{
    var acroForm = doc.getDocumentCatalog().getAcroForm();
    var field = '';
    var widget = '';
    var annotation = '';
    var pageNo = '';

    for(field in acroForm.getFields()){

    if(field.getPartialName() == fieldName){

        for(widget in field.getWidgets()){

           for(annotation in widget.getPage().getAnnotations()){

             if(annotation.getSubtype() == 'widget'){

                pageNo = determineSafe(doc, widget);
                doc.close();
                return pageNo;
             }
           }

        }
    }
  }
return -1;  
}catch(e){
    doc.close();
writeDump(label="catch error",var='#e#');
  }
} 

doc = PDDocument.init().load(File.init('/**********/myfile.pdf'));

//returns no,  page numbers start at 0
pageNo = pageForSignature(doc, 'twtzceuxvx');

writeDump(label="pageForSignature(doc, fieldName)", var="#pageNo#");
</cfscript

General solution for single or multiple widget of (duplicate qualified name of single page)..单个或多个小部件的通用解决方案(单个页面的重复限定名称)。

List<PDAnnotationWidget>  widget=field.getWidgets();
PDDocumentCatalog catalog = doc.getDocumentCatalog();
for(int i=0;i<widget.size();i++) {
int pageNumber = 1+ catalog.getPages().indexOf(field.getWidgets().get(i).getPage());

/* field co ordinate also can get here for single or multiple both it will work..*/ /* 字段坐标也可以在这里获取单个或多个它都可以工作..*/

//PDRectangle r= widget.get(i).getRectangle(); //PDRectangle r= widget.get(i).getRectangle();

}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM