简体   繁体   English

按格式从pdf提取文本

[英]Extract text from pdf by format

I am trying to extract the headlines from pdfs. 我正在尝试从pdf中提取标题。 Until now I tried to read the plain text and take the first line (which didn't work because in plain text the headlines were not at the beginning) and just read the text from a region (which didn't work, because the regions are not always the same). 到现在为止,我尝试阅读纯文本并采用第一行(该行不起作用,因为在纯文本中,标题不是开头),而只是从一个区域中读取文本(该行不起作用,因为该区域并不总是一样)。

The easiest way to do this is in my opinion to read just text with a special format (font, fontsize etc.). 在我看来,最简单的方法是只读取具有特殊格式(字体,字体大小等)的文本。 Is there a way to do this? 有没有办法做到这一点?

You can enumerate all text objects on a PDF page using Docotic.Pdf library . 您可以使用Docotic.Pdf库枚举PDF页面上的所有文本对象。 For each of the text objects information about the font and the size of the object is available. 对于每个文本对象,都提供有关字体和对象大小的信息。 Below is a sample 下面是一个示例

public static void listTextObjects(string inputPdf)
{
    using (PdfDocument pdf = new PdfDocument(inputPdf))
    {
        string format = "{0}\n{1}, {2}px at {3}";

        foreach (PdfPage page in pdf.Pages)
        {
            foreach (PdfPageObject obj in page.GetObjects())
            {
                if (obj.Type != PdfPageObjectType.Text)
                    continue;

                PdfTextData text = (PdfTextData)obj;

                string message = string.Format(format, text.Text, text.Font.Name,
                    text.Size.Height, text.Position);
                Console.WriteLine(message);
            }
        }
    }
}

The code will output lines like the following for each text object on each page of the input PDF file. 对于输入的PDF文件的每一页上的每个文本对象,该代码将输出以下行。

FACTUUR
Helvetica-BoldOblique, 19.04px at { X=51.12; Y=45.54 }

You can use the retrieved information to find largest text or bold text or text with other properties used to format the headline. 您可以使用检索到的信息来查找最大的文本或粗体文本,或具有用于格式化标题的其他属性的文本。

If your PDF is guaranteed to have headline as the topmost text on a page than you can use even simpler approach 如果保证您的PDF的标题是页面上最顶部的文本,那么您可以使用更简单的方法

public static void printText(string inputPdf)
{
    using (PdfDocument pdf = new PdfDocument(inputPdf))
    {
        foreach (PdfPage page in pdf.Pages)
        {
            string text = page.GetTextWithFormatting();
            Console.WriteLine(text);
        }
    }
}

The GetTextWithFormatting method returns text in the reading order (ie from left top to right bottom position). GetTextWithFormatting方法按阅读顺序(即,从左上到右下)返回文本。

Disclaimer: I am one of the developer of the library. 免责声明:我是该库的开发人员之一。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM