从PDF文件中提取页码

Question

I have a PDF document which might have been created by extracting few pages from another PDF document. 我有一个PDF文档，该文档可能是通过从另一个PDF文档中提取几页而创建的。 I am wondering How do I get the page number. 我想知道如何获取页码。 As the starting page number is 572, which for a complete PDF document should have been 1. 因为起始页码是572，所以完整的PDF文档应该是1。

Do you think converting the PDF into an XMl will sort this issue? 您认为将PDF转换为XMl是否可以解决此问题？

Answer 1

Most probably the document contains /PageLabels entry in the Document Catalog . 该文档很可能在Document Catalog包含/PageLabels条目。 This entry specifies the numbering style for page numbers and the starting number, too. 此项也指定页码和起始编号的编号样式。

You might have to update the starting number or remove the entry completely. 您可能必须更新起始号码或完全删除条目。 The following document contains more information about /PageLabels entry: 以下文档包含有关/PageLabels条目的更多信息：

Specifying consistent page numbering for PDF documents 为PDF文档指定一致的页码

The example 2 in the document might be useful if you decide to update the entry. 如果您决定更新条目，文档中的示例2可能会有用。

Answer 2

Finally figured it out using iText. 终于使用iText弄清楚了。 Would not have been possible without Bovrosky's hint. 如果没有Bovrosky的提示，这是不可能的。 Tons of thanks to him. 多亏了他。 Posting the code sample: 发布代码示例：

public void process(PdfReader reader) {
    PRIndirectReference obj = (PRIndirectReference) dict.get(com.itextpdf.text.pdf.PdfName.PAGELABELS);
    System.out.println(obj.getNumber());
    PdfObject ref = reader.getPdfObject(obj.getNumber());
    PdfArray array = (PdfArray)((PdfDictionary) ref).get(com.itextpdf.text.pdf.PdfName.NUMS);
    System.out.println("Start Page: " + resolvePdfIndirectReference(array, reader));
}

private static int resolvePdfIndirectReference(PdfObject obj, PdfReader reader) {
    if (obj instanceof PdfArray) {
        PdfDictionary subDict = null;
        PdfIndirectReference indRef = null;
        ListIterator < PdfObject > itr = ((PdfArray) obj).listIterator();
        while (itr.hasNext()) {
            PdfObject pdfObj = itr.next();
            if (pdfObj instanceof PdfIndirectReference)
                indRef = (PdfIndirectReference) pdfObj;
            if (pdfObj instanceof PdfDictionary) {
                subDict = (PdfDictionary) pdfObj;
                break;
            }
        }
        if (subDict != null) {
            return resolvePdfIndirectReference(subDict, reader);
        } else if (indRef != null)
            return resolvePdfIndirectReference(indRef, reader);
    } else if (obj instanceof PdfIndirectReference) {
        PdfObject ref = reader.getPdfObject(((PdfIndirectReference) obj).getNumber());
        return resolvePdfIndirectReference(ref, reader);
    } else if (obj instanceof PdfDictionary) {
        PdfNumber num = (PdfNumber)((PdfDictionary) obj).get(com.itextpdf.text.pdf.PdfName.ST);
        return num.intValue();
    }
    return 0;
}

从PDF文件中提取页码

问题描述

2 个解决方案

解决方案1
1 2013-05-31 11:21:44

解决方案2
1 2013-05-31 19:58:08

从PDF文件中提取页码

问题描述

2 个解决方案

解决方案1 1 2013-05-31 11:21:44

解决方案2 1 2013-05-31 19:58:08

解决方案1
1 2013-05-31 11:21:44

解决方案2
1 2013-05-31 19:58:08