Python PDF text extraction

Question

I am trying to extract text from a PDF file, below is my code:

file_path = "xxx.pdf"

pdfFileObj = open(file_path, 'rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

print(pdfReader.numPages) 

pageObj = pdfReader.getPage(0) 
    
print(pageObj.extractText()) 
    
pdfFileObj.close()

After running the code, I kept getting the error message:
---> print(pageObj.extractText())
AttributeError: 'NameObject' object has no attribute 'get_data'

The output is like:

68
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
c:\HL\PDF parsing\pdfparsing pdfminer.ipynb Cell 6 in <cell line: 20>()
     17 pageObj = pdfReader.getPage(11) 
     19 # extracting text from page 
---> 20 print(pageObj.extractText()) 
     22 # closing the pdf file object 
     23 pdfFileObj.close()

File c:\Users\Hlin\Anaconda3\lib\site-packages\PyPDF2\_page.py:1545, in PageObject.extractText(self, Tj_sep, TJ_sep)
   1539 """
   1540 .. deprecated:: 1.28.0
   1541 
   1542     Use :meth:`extract_text` instead.
   1543 """
   1544 deprecate_with_replacement("extractText", "extract_text")
-> 1545 return self.extract_text()

File c:\Users\Hlin\Anaconda3\lib\site-packages\PyPDF2\_page.py:1517, in PageObject.extract_text(self, Tj_sep, TJ_sep, orientations, space_width, *args)
   1514 if isinstance(orientations, int):
   1515     orientations = (orientations,)
-> 1517 return self._extract_text(
   1518     self, self.pdf, orientations, space_width, PG.CONTENTS
   1519 )
...
   (...)
    205         .replace(b">>", b"\n}\n")  # some solution to find it back
    206     )

AttributeError: 'NameObject' object has no attribute 'get_data'

And I couldn't find any similar error, and the weird thing is that the code went well with other files but not with this specific one.

Does anyone have any idea what could happen with my code or the PDF file?

Answer 1

Try reinstalling the PyPDF2 library.

pip uninstall pypdf2

pip install pypdf2

Python PDF text extraction

Question

1 answers

solution1
0 2022-09-22 04:56:56

Python PDF text extraction

Question

1 answers

solution1 0 2022-09-22 04:56:56

solution1
0 2022-09-22 04:56:56