如何使用 Python 计算 pdf 中文本摘录列表中的单词数？

Question

我正在尝试计算从 PDF 中提取的一系列单词，但我只得到 0，这是不正确的。

total_number_of_keywords = 0
pdf_file = "CapitalCorp.pdf"
tables=[]

words = ['blank','warrant ','offering','combination ','SPAC','founders']
count={} # is a dictionary data structure in Python


with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for i,pg in enumerate(pages):
        tbl = pages[i].extract_tables()
        for elem in words:
            count[elem] = 0
        for line in f'{i} --- {tbl}' :
            elements = line.split()
            for word in words:
                count[word] = count[word]+elements.count(word)
print (count)

Answer 1

这将完成工作：

import pdfplumber
pdf_file = "CapitalCorp.pdf"
words = ['blank','warrant ','offering','combination ','SPAC','founders']

# Get text
text = ''
with pdfplumber.open(pdf_file) as pdf:
    for i, page in enumerate(pdf.pages):
        text = text+'\n'+str(page.extract_text())

# Setup count dictionary
count = {}
for elem in words:
    count[elem] = 0
        
# Count occurences
for i, el in enumerate(words):
    count[f'{words[i]}'] = text.count(el)

首先，将 PDF 的内容存储在变量text中，这是一个字符串。

然后，您设置count字典，每个words的元素都有一个键，并且各自的值为 0。

最后，您使用count()方法计算text中每个words元素的出现次数，并将其存储在count字典的相应键中。

如何使用 Python 计算 pdf 中文本摘录列表中的单词数？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-10-08 09:05:04

如何使用 Python 计算 pdf 中文本摘录列表中的单词数？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-10-08 09:05:04

解决方案1
1 已采纳 2021-10-08 09:05:04