I am not sure what I am doing wrong here.
While looping through the pages of a PDF - I get the page content. For example:
Page 1 = 1
Page 2 = 2
Page 3 = 3
The code:
PdfReader pdfReader = new PdfReader(filename);
PdfDocument pdfDoc = new PdfDocument(pdfReader);
var strategy = new SimpleTextExtractionStrategy();
for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
{
try
{
string pageContent = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy);
// do stuff with pageContent
}
}
The output:
First loop = Page 1 = 1
Second loop = Page 1 = 1, Page 2 = 2
Third loop = Page 1 = 1, Page 2 = 2, Page 3 = 3
I moved pageContent out of the loop and added this code prior to the try statement:
pageContent = "";
I stepped through, and the pageContent is "" on the second loop. Yet after GetTextFromPage - it is both the first and second page of text (on second loop).
This has occured on a variety of PDFs, so figure it is my code not the PDF in question.
I spotted the issue - though I don't think this should be an issue...
PdfReader pdfReader = new PdfReader(filename);
PdfDocument pdfDoc = new PdfDocument(pdfReader);
for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
{
try
{
var strategy = new SimpleTextExtractionStrategy();
string pageContent = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy);
// do stuff with pageContent
}
}
Strategy has to be within the Try function - once placed there, it returns just the requested page - and does not append them.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.