C# [itext7] GetTextFromPage appends each page

Question

I am not sure what I am doing wrong here.

While looping through the pages of a PDF - I get the page content. For example:

Page 1 = 1

Page 2 = 2

Page 3 = 3

The code:

PdfReader pdfReader = new PdfReader(filename);
PdfDocument pdfDoc = new PdfDocument(pdfReader);
var strategy = new SimpleTextExtractionStrategy();
for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
{
    try
    {
        string pageContent = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy);
        // do stuff with pageContent
    }
}

The output:

First loop = Page 1 = 1

Second loop = Page 1 = 1, Page 2 = 2

Third loop = Page 1 = 1, Page 2 = 2, Page 3 = 3

I moved pageContent out of the loop and added this code prior to the try statement:

pageContent = "";

I stepped through, and the pageContent is "" on the second loop. Yet after GetTextFromPage - it is both the first and second page of text (on second loop).

This has occured on a variety of PDFs, so figure it is my code not the PDF in question.

Answer 1

I spotted the issue - though I don't think this should be an issue...

PdfReader pdfReader = new PdfReader(filename);
PdfDocument pdfDoc = new PdfDocument(pdfReader);
for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
{
    try
    {
        var strategy = new SimpleTextExtractionStrategy();
        string pageContent = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy);
        // do stuff with pageContent
    }
}

Strategy has to be within the Try function - once placed there, it returns just the requested page - and does not append them.

C# [itext7] GetTextFromPage appends each page

Question

1 answers

solution1
1 2022-01-22 09:55:55

C# [itext7] GetTextFromPage appends each page

Question

1 answers

solution1 1 2022-01-22 09:55:55

solution1
1 2022-01-22 09:55:55