[英]Alternate of PyPDF2
I am extracting the text from a .pdf file using PyPDF2 package. 我正在使用PyPDF2包从.pdf文件中提取文本。 I am getting output but not as in it's desired form.
我正在获取输出,但未达到所需的形式。 I am unable to find where's the problem?
我找不到问题所在?
The code snippet is as follows: 代码片段如下:
import PyPDF2
def Read(startPage, endPage):
global text
text = []
cleanText = " "
pdfFileObj = open('F:\\Pen Drive 8 GB\\PDF\\Handbooks\\book1.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
print(num_pages)
while (startPage <= endPage):
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.strip().split()
print(text)
Read(3, 3)
The output which I am getting at present is attached for the reference and which is as follows: 现将我目前得到的输出作为参考,其内容如下:
Any help is highly appreciated. 非常感谢您的帮助。
this line cleanText += myWord
just concatenates all of the words to one long string. 这行
cleanText += myWord
仅将所有单词连接为一个长字符串。 if you want to filter '\\n'
, instead of: 如果要过滤
'\\n'
,而不是:
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.strip().split()
you can do this: 你可以这样做:
text = [w for w in text if w != '\n']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.