在Python中從PDF提取單詞列表

Question

我試圖以列表的形式提取PDF的單詞。

我可以從PDF中提取文本，但無法將其放在列表中

import PyPDF2
import pandas as pd
PDFfilename = '1200.pdf' 

pdfFileObj = open(PDFfilename, 'rb') 

pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

for i in range(1, pdfReader.numPages):
    pageObj = pdfReader.getPage(i) 
    print('\n\n')
    txt=pageObj.extractText();
    print(txt)
pdfFileObj.close()

預期結果：[阿拉巴馬州，建築物，..]實際結果：阿拉巴馬州建築物

Answer 1

如果您的結果看起來像這樣---阿拉巴馬州發生了什么事

txt = txt.split( )
print txt

Answer 2

您可以為此使用split（）方法。 喜歡：

txt=pageObj.extractText().split()

Answer 3

如果您想對文本做更多的事情，也可以標記它。 為了處理此問題，我建議使用SpaCy 。

首先，安裝它並以英語添加SpaCy的“小”模型

pip install spacy
python -m spacy download en_core_web_sm

然后，將這三行添加到您的代碼。

import spacy # with other imports
nlp = spacy.load("en_core_web_sm") # early in your script to load the model
doc = nlp(txt) # before your print(txt) line

doc將是可迭代的。 例如，您將能夠使用語音標記來分析每個單詞。

for token in doc:
  print(token, token.pos_)

輸出：

Alabama PROPN # 'PROPN' means proper noun
Building NOUN

玩得開心：）

在Python中從PDF提取單詞列表

問題描述

3 個解決方案

解決方案1
0 2019-06-25 18:15:00

解決方案2
0 2019-06-25 18:17:37

解決方案3
0 2019-06-25 18:35:43

在Python中從PDF提取單詞列表

問題描述

3 個解決方案

解決方案1 0 2019-06-25 18:15:00

解決方案2 0 2019-06-25 18:17:37

解決方案3 0 2019-06-25 18:35:43

解決方案1
0 2019-06-25 18:15:00

解決方案2
0 2019-06-25 18:17:37

解決方案3
0 2019-06-25 18:35:43