如何使用 Python 计算 pdf 中文本摘录列表中的单词数？

Question

I am trying to count a serie of words extract from a PDF but I get only 0 and it is not correct.我正在尝试计算从 PDF 中提取的一系列单词，但我只得到 0，这是不正确的。

total_number_of_keywords = 0
pdf_file = "CapitalCorp.pdf"
tables=[]

words = ['blank','warrant ','offering','combination ','SPAC','founders']
count={} # is a dictionary data structure in Python


with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for i,pg in enumerate(pages):
        tbl = pages[i].extract_tables()
        for elem in words:
            count[elem] = 0
        for line in f'{i} --- {tbl}' :
            elements = line.split()
            for word in words:
                count[word] = count[word]+elements.count(word)
print (count)

Answer 1

This will do the job:这将完成工作：

import pdfplumber
pdf_file = "CapitalCorp.pdf"
words = ['blank','warrant ','offering','combination ','SPAC','founders']

# Get text
text = ''
with pdfplumber.open(pdf_file) as pdf:
    for i, page in enumerate(pdf.pages):
        text = text+'\n'+str(page.extract_text())

# Setup count dictionary
count = {}
for elem in words:
    count[elem] = 0
        
# Count occurences
for i, el in enumerate(words):
    count[f'{words[i]}'] = text.count(el)

First, you store the content of PDF in the variable text , which is a string.首先，将 PDF 的内容存储在变量text中，这是一个字符串。

Then, you setup the count dictionary, with one key fo every element of words and respective values to 0.然后，您设置count字典，每个words的元素都有一个键，并且各自的值为 0。

Last, you count the occurrences of every element of words in text with the count() method and store it in the respective key of your count dictionary.最后，您使用count()方法计算text中每个words元素的出现次数，并将其存储在count字典的相应键中。

如何使用 Python 计算 pdf 中文本摘录列表中的单词数？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-10-08 09:05:04

如何使用 Python 计算 pdf 中文本摘录列表中的单词数？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-10-08 09:05:04

解决方案1
1 已采纳 2021-10-08 09:05:04