简体   繁体   English

Python - 计算存储在列表中的关键字在文本中出现的次数

[英]Python - Count how many times keywords stored in a list appear in text

I have a list KeywordList of 20k+ keywords.我有一个包含 20k+ 个关键字的 KeywordList 列表。 I want to check how many keywords in KeywordList appear in multiple, separate text files.我想检查 KeywordList 中有多少关键字出现在多个单独的文本文件中。 I would also like to know the overall frequency of the keywords that appear in the text files.我还想知道文本文件中出现的关键字的总体频率。 What is the best way to do this?做这个的最好方式是什么?

I would use the bag of words approach: see https://en.wikipedia.org/wiki/Bag-of-words_model我会使用词袋方法:参见https://en.wikipedia.org/wiki/Bag-of-words_model

here is an example I had a few years back, extracting word counts from a pandas dataframe:这是我几年前的一个例子,从 pandas dataframe 中提取字数:

all_words = df['keywords'].str.split(expand=True).unstack().value_counts()

this gives you a key-value pairing of unique words and their count.这为您提供了唯一单词及其计数的键值对。 Iterate over your files and you should have all of the words with their counts遍历您的文件,您应该拥有所有单词及其计数

From there you can convert your keywords and KeywordList to sets and use the intersection function. This will produce you with a set of all keywords in KeywordList从那里您可以将关键字和关键字列表转换为集合并使用intersection function。这将为您生成一组关键字列表中的所有关键字

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM