简体   繁体   中英

Python PDF text extraction

I am trying to extract text from a PDF file, below is my code:

file_path = "xxx.pdf"

pdfFileObj = open(file_path, 'rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

print(pdfReader.numPages) 

pageObj = pdfReader.getPage(0) 
    
print(pageObj.extractText()) 
    
pdfFileObj.close()

After running the code, I kept getting the error message:
---> print(pageObj.extractText())
AttributeError: 'NameObject' object has no attribute 'get_data'

The output is like:

68
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
c:\HL\PDF parsing\pdfparsing pdfminer.ipynb Cell 6 in <cell line: 20>()
     17 pageObj = pdfReader.getPage(11) 
     19 # extracting text from page 
---> 20 print(pageObj.extractText()) 
     22 # closing the pdf file object 
     23 pdfFileObj.close()

File c:\Users\Hlin\Anaconda3\lib\site-packages\PyPDF2\_page.py:1545, in PageObject.extractText(self, Tj_sep, TJ_sep)
   1539 """
   1540 .. deprecated:: 1.28.0
   1541 
   1542     Use :meth:`extract_text` instead.
   1543 """
   1544 deprecate_with_replacement("extractText", "extract_text")
-> 1545 return self.extract_text()

File c:\Users\Hlin\Anaconda3\lib\site-packages\PyPDF2\_page.py:1517, in PageObject.extract_text(self, Tj_sep, TJ_sep, orientations, space_width, *args)
   1514 if isinstance(orientations, int):
   1515     orientations = (orientations,)
-> 1517 return self._extract_text(
   1518     self, self.pdf, orientations, space_width, PG.CONTENTS
   1519 )
...
   (...)
    205         .replace(b">>", b"\n}\n")  # some solution to find it back
    206     )

AttributeError: 'NameObject' object has no attribute 'get_data'

And I couldn't find any similar error, and the weird thing is that the code went well with other files but not with this specific one.

Does anyone have any idea what could happen with my code or the PDF file?

Try reinstalling the PyPDF2 library.

pip uninstall pypdf2

pip install pypdf2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM