简体   繁体   English

如何在C#中使用iTextSharp获取pdf文件中的特定段落?

[英]how to get the particular paragraph in pdf file using iTextSharp in C#?

I am using iTextSharp in my C# winform application.I want to get particular paragraph in PDF file. 我在我的C#winform应用程序中使用iTextSharp。我想获得PDF文件中的特定段落。 Is this possible in iTextSharp? 这可能在iTextSharp中吗?

Yes and no. 是的,不是。

First the no. 首先是没有。 The PDF format doesn't have a concept of text structures such as paragraphs, sentences or even words, it just has runs of text. PDF格式没有文本结构的概念,例如段落,句子甚至单词,它只有文本的运行。 The fact that two runs of text are near to each other so that we think of them as structured is a human thing. 两行文本彼此接近,以便我们将它们视为结构化这一事实是人类的事情。 When you see something that looks like a three line paragraph in a PDF, in reality the program that generated the PDF actually did the job of chopping up the text into three unrelated text lines and then drew each line at specific x,y coordinates. 当您在PDF中看到看起来像三行段落的内容时,实际上生成PDF的程序实际上完成了将文本切割成三个不相关的文本行,然后在特定的x,y坐标处绘制每一行的工作。 And even worse, depending on what the designer wants, each line of text could be composed of smaller runs that could be words or even just characters. 更糟糕的是,根据设计师的需求,每行文本都可以由较小的运行组成,可以是单词,甚至只是字符。 So it might be draw "the cat in the hat" at 10,10 or it might be draw "t" at 10,10, then draw "h" at 14,10, then draw "e" at 18,10 and so on. 所以它可能是draw "the cat in the hat" at 10,10或者它可能draw "the cat in the hat" at 10,10 draw "t" at 10,10, then draw "h" at 14,10, then draw "e" at 18,10等等上。 This is actually pretty common with PDFs from heavily designed programs like Adobe InDesign. 对于像Adobe InDesign这样设计精良的程序中的PDF来说,这实际上很常见。

Now the yes. 现在是的。 Actually its a maybe. 实际上它可能是一个。 If you are willing to put in a little work you might be able to get iTextSharp to do what you are looking for. 如果你愿意做一些工作,你可能会得到iTextSharp来做你想要的。 There is a class called PdfTextExtractor that has a method called GetTextFromPage that will get all of the raw text from a page. 有一个名为PdfTextExtractor的类,它有一个名为GetTextFromPage的方法,它将从页面获取所有原始文本。 The last parameter to this method is an object that implements the ITextExtractionStrategy interface. 此方法的最后一个参数是实现ITextExtractionStrategy接口的对象。 If you create your own class that implements this interface you can process each run of text and perform your own logic. 如果您创建自己的实现此接口的类,则可以处理每个文本运行并执行您自己的逻辑。

In this interface there's a method called RenderText which gets called for every run of text. 在这个接口中有一个名为RenderText的方法,每次运行都会调用它。 You'll be given a iTextSharp.text.pdf.parser.TextRenderInfo object from which you can get the raw text from the run as well as other things like current coordinates that it is starting at, current font, etc. Since a visual line of text can be composed of multiple runs, you can use this method to compare the run's baseline (the starting x coordinate) to the previous run to determine if it is part of the same visual line. 您将获得一个iTextSharp.text.pdf.parser.TextRenderInfo对象,您可以从该对象中获取运行中的原始文本以及其他内容,例如当前开始的坐标,当前字体等。由于视觉线文本可以由多次运行组成,您可以使用此方法将运行的基线(起始x坐标)与上一次运行进行比较,以确定它是否是同一视线的一部分。

Below is an example of an implementation of that interface: 以下是该接口的实现示例:

    public class TextAsParagraphsExtractionStrategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy {
        //Text buffer
        private StringBuilder result = new StringBuilder();

        //Store last used properties
        private Vector lastBaseLine;

        //Buffer of lines of text and their Y coordinates. NOTE, these should be exposed as properties instead of fields but are left as is for simplicity's sake
        public List<string> strings = new List<String>();
        public List<float> baselines = new List<float>();

        //This is called whenever a run of text is encountered
        public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo) {
            //This code assumes that if the baseline changes then we're on a newline
            Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();

            //See if the baseline has changed
            if ((this.lastBaseLine != null) && (curBaseline[Vector.I2] != lastBaseLine[Vector.I2])) {
                //See if we have text and not just whitespace
                if ((!String.IsNullOrWhiteSpace(this.result.ToString()))) {
                    //Mark the previous line as done by adding it to our buffers
                    this.baselines.Add(this.lastBaseLine[Vector.I2]);
                    this.strings.Add(this.result.ToString());
                }
                //Reset our "line" buffer
                this.result.Clear();
            }

            //Append the current text to our line buffer
            this.result.Append(renderInfo.GetText());

            //Reset the last used line
            this.lastBaseLine = curBaseline;
        }

        public string GetResultantText() {
            //One last time, see if there's anything left in the buffer
            if ((!String.IsNullOrWhiteSpace(this.result.ToString()))) {
                this.baselines.Add(this.lastBaseLine[Vector.I2]);
                this.strings.Add(this.result.ToString());
            }
            //We're not going to use this method to return a string, instead after callers should inspect this class's strings and baselines fields.
            return null;
        }

        //Not needed, part of interface contract
        public void BeginTextBlock() { }
        public void EndTextBlock() { }
        public void RenderImage(ImageRenderInfo renderInfo) { }
    }

To call it we'd do: 要打电话给我们,我们会这样做:

        PdfReader reader = new PdfReader(workingFile);
        TextAsParagraphsExtractionStrategy S = new TextAsParagraphsExtractionStrategy();
        iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);
        for (int i = 0; i < S.strings.Count; i++) {
            Console.WriteLine("Line {0,-5}: {1}", S.baselines[i], S.strings[i]);
        }

We're actually throwing away the value from GetTextFromPage and instead inspecting the worker's baselines and strings array fields. 我们实际上丢弃了GetTextFromPage的值,而是检查了worker的baselinesstrings数组字段。 The next step for this would be to compare the baselines and try to determine how to group lines together to become paragraphs. 下一步是比较基线并尝试确定如何将线组合在​​一起成为段落。

I should note, not all paragraphs have spacing that's different from individual lines of text. 我应该注意,并非所有段落的间距都与单独的文本行不同。 For instance, if you run the PDF created below through the code above you'll see that every line of text is 18 points away from each other, regardless of if the line forms a new paragraph or not. 例如,如果您通过上面的代码运行下面创建的PDF,您将看到每行文本彼此相距18个点,无论该行是否形成新段落。 If you open the PDF it creates in Acrobat and cover everything but the first letter of each line you'll see that your eye can't even tell the difference between a line break and a paragraph break. 如果您打开它在Acrobat中创建的PDF并覆盖除了每行的第一个字母之外的所有内容,您将看到您的眼睛甚至无法区分换行符和段落符号。

        using (FileStream fs = new FileStream(workingFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
            using (Document doc = new Document(PageSize.LETTER)) {
                using (PdfWriter writer = PdfWriter.GetInstance(doc, fs)) {
                    doc.Open();
                    doc.Add(new Paragraph("Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna."));
                    doc.Add(new Paragraph("This"));
                    doc.Add(new Paragraph("Is"));
                    doc.Add(new Paragraph("A"));
                    doc.Add(new Paragraph("Test"));
                    doc.Close();
                }
            }
        }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM