简体   繁体   中英

C# [itext7] GetTextFromPage appends each page

I am not sure what I am doing wrong here.

While looping through the pages of a PDF - I get the page content. For example:

Page 1 = 1

Page 2 = 2

Page 3 = 3

The code:

PdfReader pdfReader = new PdfReader(filename);
PdfDocument pdfDoc = new PdfDocument(pdfReader);
var strategy = new SimpleTextExtractionStrategy();
for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
{
    try
    {
        string pageContent = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy);
        // do stuff with pageContent
    }
}

The output:

First loop = Page 1 = 1

Second loop = Page 1 = 1, Page 2 = 2

Third loop = Page 1 = 1, Page 2 = 2, Page 3 = 3

I moved pageContent out of the loop and added this code prior to the try statement:

pageContent = "";

I stepped through, and the pageContent is "" on the second loop. Yet after GetTextFromPage - it is both the first and second page of text (on second loop).

This has occured on a variety of PDFs, so figure it is my code not the PDF in question.

I spotted the issue - though I don't think this should be an issue...

PdfReader pdfReader = new PdfReader(filename);
PdfDocument pdfDoc = new PdfDocument(pdfReader);
for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
{
    try
    {
        var strategy = new SimpleTextExtractionStrategy();
        string pageContent = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy);
        // do stuff with pageContent
    }
}

Strategy has to be within the Try function - once placed there, it returns just the requested page - and does not append them.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM