简体   繁体   English

如何在c#中逐行读取PDF文件?

[英]How to read a PDF file line by line in c#?

In my windows 8 application, I would like to read a PDF line by line then I would like to assign a String array.在我的 Windows 8 应用程序中,我想逐行读取 PDF,然后我想分配一个字符串数组。 How can I do it?我该怎么做?

    public StringBuilder addd= new StringBuilder();
    string[] array;

    private async void btndosyasec_Click(object sender, RoutedEventArgs e)
    {
        FileOpenPicker openPicker = new FileOpenPicker();
        openPicker.ViewMode = PickerViewMode.List;
        openPicker.SuggestedStartLocation = PickerLocationId.PicturesLibrary;
        openPicker.FileTypeFilter.Add(".pdf");

        StorageFile file = await openPicker.PickSingleFileAsync();



        if (file != null)
        {

            PdfReader reader = new PdfReader((await file.OpenReadAsync()).AsStream());

            for (int page = 1; page <= reader.NumberOfPages; page++)
            {

                addd.Append(PdfTextExtractor.GetTextFromPage(reader, page));
                string tmp= PdfTextExtractor.GetTextFromPage(reader, page);

                array[page] = tmp.ToString();

                reader.Close();
            }
        }
    }

Hi I had this problem too, I used this code, it worked.嗨,我也有这个问题,我使用了这段代码,它有效。

You will need a reference to the iTextSharp lib.您将需要对 iTextSharp 库的引用。

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

PdfReader reader = new PdfReader(@"D:\test pdf\Blood Journal.pdf");
int intPageNum = reader.NumberOfPages;
string[] words;
string line;

    for (int i = 1; i <= intPageNum; i++)
    {
        text = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());

        words = text.Split('\n');
        for (int j = 0, len = words.Length; j < len; j++)
        {
            line = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(words[j]));
        }
    }

words array contains lines of pdf file words 数组包含 pdf 文件的行

Below code work for iText7以下代码适用于 iText7

using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;


public void ExtractTextFromPDF(string filePath)
{
    PdfReader pdfReader = new PdfReader(filePath);
    PdfDocument pdfDoc = new PdfDocument(pdfReader);

    for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
    {
        ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
        string pageContent = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy);

        Console.WriteLine("pageContent : " + pageContent);
    }
    pdfDoc.Close();
    pdfReader.Close();
}

If you are looking for something Licence Free/Open Source with basic text extraction from PDF, then you can go for PdfClown which has Support for both .Net Framework as well as .NET CORE (though Beta version wrt .NET Standard 2.0).如果您正在寻找具有从 PDF 中提取基本文本的无许可证/开源软件,那么您可以选择PdfClown ,它同时支持 .Net Framework 和 .NET CORE(尽管 Beta 版与 .NET Standard 2.0 相同)。 For More Info or Samples, take look at有关更多信息或示例,请查看

https://www.nuget.org/packages/PdfClown.NetStandard/0.2.0-beta https://www.nuget.org/packages/PdfClown.NetStandard/0.2.0-beta

https://sourceforge.net/p/clown/code/HEAD/tree/trunk/dotNET/pdfclown.samples.cli/ https://sourceforge.net/p/clown/code/HEAD/tree/trunk/dotNET/pdfclown.samples.cli/

Below sample is wrt .NET CORE.下面的示例是 .NET CORE。

public class PdfClownUtil
{
    private static readonly string fileSrcPath = "MyTestDoc.pdf";
    private readonly StringBuilder stringBuilder_1 = new StringBuilder();
    public string GetPdfTextContent()
    {
        PdfDocuments.Document document = new File(fileSrcPath).Document;
        StringBuilder stringBuilder_2 = new StringBuilder();

        TextExtractor extractor = new TextExtractor();
        foreach (Page page in document.Pages)
        {
            // Approach-1: 
            Extract(new ContentScanner(page));

            // Approach-2 with additional Options: 
            IList<ITextString> textStrings = extractor.Extract(page)[TextExtractor.DefaultArea];
            foreach (ITextString textString in textStrings)
            {
                stringBuilder_2.Append(textString.Text);
            }
            stringBuilder_2.AppendLine();
        }
        var content = stringBuilder_2.ToString();
        return content;
    }

    // Approach-1: 
    private void Extract(ContentScanner level)
    {
        if (level == null)
        {
            return;
        }                

        while (level.MoveNext())
        {
            ContentObject content = level.Current;
            if (content is ShowText)
            {
                Font font = level.State.Font;
                // Extract the current text chunk, decoding it!
                this.stringBuilder_1.Append(font.Decode(((ShowText)content).Text));
            }
            else if (content is Text || content is ContainerObject)
            {
                // Scan the inner level!
                Extract(level.ChildLevel);
            }
        }
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM