简体   繁体   中英

extracting text from pdf using python3

I am trying to extract text from pdf file using slate module, as shown in this

$sudo pip install https://codeload.github.com/timClicks/slate/zip/master
Collecting https://codeload.github.com/timClicks/slate/zip/master
  Downloading https://codeload.github.com/timClicks/slate/zip/master
Requirement already satisfied: distribute in /usr/lib/python3.5/site-packages (from slate==0.5.2)
Requirement already satisfied: pdfminer3k in /usr/lib/python3.5/site-packages (from slate==0.5.2)
Requirement already satisfied: setuptools>=0.7 in /usr/lib/python3.5/site-packages (from distribute->slate==0.5.2)
Requirement already satisfied: pytest>=2.0 in /usr/lib/python3.5/site-packages (from pdfminer3k->slate==0.5.2)
Requirement already satisfied: ply>=3.4 in /usr/lib/python3.5/site-packages (from pdfminer3k->slate==0.5.2)
Requirement already satisfied: py>=1.4.29 in /usr/lib/python3.5/site-packages (from pytest>=2.0->pdfminer3k->slate==0.5.2)
Installing collected packages: slate
  Found existing installation: slate 0.3
    Uninstalling slate-0.3:
      Successfully uninstalled slate-0.3
  Running setup.py install for slate ... done
Successfully installed slate-0.5.2

and I am trying:

#!/usr/bin/python3
import slate

with open('/var/tmp/PhysRevB.93.014203.pdf') as fp:
    doc = slate.PDF(fp)
print(len(doc))
print(doc[0])

which is giving me error:

$python3 tstslt.py 
Traceback (most recent call last):
  File "tstslt.py", line 2, in <module>
    import slate
  File "/usr/lib/python3.5/site-packages/slate/__init__.py", line 66, in <module>
    from .classes import PDF
  File "/usr/lib/python3.5/site-packages/slate/classes.py", line 25, in <module>
    import utils
ImportError: No module named 'utils'

I can extract the text using PyPDF2 , but looking if slate is better.

According to this issue one of slate's dependecies (pdfminer) doesn't support Python3

(...)

The "pdfminer" that is required does not work because it is currently incompatible with python 3.5.

It says so on their readme:

https://github.com/euske/pdfminer

"Install Python 2.6 or newer. (Python 3 is not supported.)"

slate3k is a fork of the original slate for python3.

you can install slate3k using pip install slate3k

After You installed the slate3k ,You also have to setup the mode, how to open the file:

#/usr/bin/python3
import slate

with open('/var/tmp/PhysRevB.93.014203.pdf', 'rb') as fp:
    doc = slate.PDF(fp)
print(len(doc))
print(doc[0])

Just install pip install utils`` after you installed pip install https://github.com/timClicks/slate/archive/master.zip

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM