For converting pdf to text I am using the following command:
pdf2txt.py -o text.txt example.pdf # It will convert example.pdf to text.txt
But I have more than 1000 pdf files which I need to convert to text file first and then do the analysis.
Is there a way through which I can use this command to iterate over the pdf files and convert all of them?
I would suggest you to have a shell script:
for f (*.pdf) {pdf2txt.py -o $f $f.txt}
Then read all .txt
files using python for your analysis.
Using only python to convert:
from subprocess import call
import glob
for pdf_file in glob.glob('*.pdf'):
call(["pdf2txt.py", "-o", pdf_file, pdf_file[:-3]+"txt"])
the python code went wrong on my win1o OS( OSError: [WinError 193] %1 is not a valid Win32 application), the for loop should be:
for pdf_file in glob.glob('*.pdf'):
call(['python.exe','pdf2txt.py','-o',pdf_file[:-3]+'txt',pdf_file])
Be careful, the parameter of file i/o is opposite, if you remain the same order, your files would be overwritten by empty files...
Still thanks Gurupad Hegde, show me the way to covert files, it helps a lot!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.