简体   繁体   English


[英]How do I walk through tree of Pdf Objects in PDFSharp?

I am trying to to walk though the tree of PdfItem objects in an existing PDF document using PDFSharp in c#. 我试图使用c#中的PDFSharp在现有PDF文档中遍历PdfItem对象树。

I want to create a hierarchy of all the objects as I go along -- similar to what the "PDF Explorer" example does -- but I want it to be a tree instead of a flat list of all the objects. 我想在创建时创建所有对象的层次结构 - 类似于“PDF Explorer”示例所做的 - 但我希望它是树而不是所有对象的平面列表。

The root node is document.Internals.Catalog. 根节点是document.Internals.Catalog。 And I want to to walk down through all the document.Internals.Catalog.Elements until I have visited every element. 我想要浏览所有document.Internals.Catalog.Elements,直到我访问过每个元素。

One of the problems I run into is that there are circular references in the tree and I can't figure out how to detect them. 我遇到的一个问题是树中有循环引用,我无法弄清楚如何检测它们。

Any code samples out there? 有代码样本吗?

This post by marihanzo on the PDFSharp forums has worked for us: marihanzo在PDFSharp论坛上的这篇文章对我们有用:

http://forum.pdfsharp.net/viewtopic.php?f=2&t=527&p=1603 http://forum.pdfsharp.net/viewtopic.php?f=2&t=527&p=1603

The only issue we've had was handling fields with \\r\\n in them. 我们唯一的问题是使用\\ r \\ n处理字段。 Here is a copy of the code in case the forum post gets lost. 这是代码的副本,以防论坛帖子丢失。

PDFParser.cs PDFParser.cs

public class PDFParser
    /// BT = Beginning of a text object operator
    /// ET = End of a text object operator
    /// Td move to the start of next line
    ///  5 Ts = superscript
    /// -5 Ts = subscript

    #region Fields

    #region _numberOfCharsToKeep
    /// <summary>
    /// The number of characters to keep, when extracting text.
    /// </summary>
    private static int _numberOfCharsToKeep = 15;


    #region ExtractTextFromPDFBytes
    /// <summary>
    /// This method processes an uncompressed Adobe (text) object
    /// and extracts text.
    /// </summary>
    /// <param name="input">uncompressed</param>
    /// <returns></returns>
    public string ExtractTextFromPDFBytes(byte[] input)
        if (input == null || input.Length == 0) return "";

            string resultString = "";

            // Flag showing if we are we currently inside a text object
            bool inTextObject = false;

            // Flag showing if the next character is literal
            // e.g. '\\' to get a '\' character or '\(' to get '('
            bool nextLiteral = false;

            // () Bracket nesting level. Text appears inside ()
            int bracketDepth = 0;

            // Keep previous chars to get extract numbers etc.:
            char[] previousCharacters = new char[_numberOfCharsToKeep];
            for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';

            for (int i = 0; i < input.Length; i++)
                char c = (char)input[i];

                if (inTextObject)
                    // Position the text
                    if (bracketDepth == 0)
                        if (CheckToken(new string[] { "TD", "Td" }, previousCharacters))
                            resultString += "\n\r";
                            if (CheckToken(new string[] { "'", "T*", "\"" }, previousCharacters))
                                resultString += "\n";
                                if (CheckToken(new string[] { "Tj" }, previousCharacters))
                                    resultString += " ";

                    // End of a text object, also go to a new line.
                    if (bracketDepth == 0 &&
                        CheckToken(new string[] { "ET" }, previousCharacters))

                        inTextObject = false;
                        resultString += " ";
                        // Start outputting text
                        if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
                            bracketDepth = 1;
                            // Stop outputting text
                            if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
                                bracketDepth = 0;
                                // Just a normal text character:
                                if (bracketDepth == 1)
                                    // Only print out next character no matter what.
                                    // Do not interpret.
                                    if (c == '\\' && !nextLiteral)
                                        nextLiteral = true;
                                        if (((c >= ' ') && (c <= '~')) ||
                                            ((c >= 128) && (c < 255)))
                                            resultString += c.ToString();

                                        nextLiteral = false;

                // Store the recent characters for
                // when we have to go back for a checking
                for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
                    previousCharacters[j] = previousCharacters[j + 1];
                previousCharacters[_numberOfCharsToKeep - 1] = c;

                // Start of a text object
                if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters))
                    inTextObject = true;
            return resultString;
            return "";

    #region CheckToken
    /// <summary>
    /// Check if a certain 2 character token just came along (e.g. BT)
    /// </summary>
    /// <param name="search">the searched token</param>
    /// <param name="recent">the recent character array</param>
    /// <returns></returns>
    private bool CheckToken(string[] tokens, char[] recent)
        foreach (string token in tokens)
            if (token.Length > 1)
                if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
                    (recent[_numberOfCharsToKeep - 2] == token[1]) &&
                    ((recent[_numberOfCharsToKeep - 1] == ' ') ||
                    (recent[_numberOfCharsToKeep - 1] == 0x0d) ||
                    (recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
                    ((recent[_numberOfCharsToKeep - 4] == ' ') ||
                    (recent[_numberOfCharsToKeep - 4] == 0x0d) ||
                    (recent[_numberOfCharsToKeep - 4] == 0x0a))
                    return true;
                return false;

        return false;

and the calling code: 和调用代码:

   public override String ExtractText()
        String outputText = "";
            PdfDocument inputDocument = PdfReader.Open(this._sDirectory + this._sFileName, PdfDocumentOpenMode.ReadOnly);

            foreach (PdfPage page in inputDocument.Pages)
                for (int index = 0; index < page.Contents.Elements.Count; index++)

                    PdfDictionary.PdfStream stream = page.Contents.Elements.GetDictionary(index).Stream;
                    outputText += new PDFParser().ExtractTextFromPDFBytes(stream.Value);

        catch (Exception e)
            PDF_ParseException oEx = new PDF_ParseException(this, e);
        return outputText;

Read and analyze the entirety of the collection, and build an in-memory tree of your own. 阅读并分析整个集合,并构建自己的内存树。 Then walk that tree. 然后走那棵树。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM