简体   繁体   English

计算每个单词出现的不同行数

[英]Count number of different rows in which each word appears

I have a Pandas DataFrame (or a Series, given that I'm just using one column) that contains strings.我有一个包含字符串的 Pandas DataFrame(或一个系列,因为我只使用一列)。 I also have a list of words.我也有一个单词列表。 For each word in this list, I want to check how many different rows it appears in at least once.对于这个列表中的每个单词,我想检查它至少出现在多少不同的行中。 For example:例如:

words = ['hi', 'bye', 'foo', 'bar']
df = pd.Series(["hi hi hi bye foo",
                "bye bye bye bye",
                "bar foo hi bar",
                "hi bye foo bar"])

In this case, the output should be在这种情况下,输出应该是

0   hi      3
1   bye     3
2   foo     3
3   bar     2

Because "hi" appears in three different rows (1st, 3rd and 4th), "bar" appears in two (3rd and 4th), and so on.因为“hi”出现在三个不同的行(1st、3rd 和 4th)中,“bar”出现在两行(3rd 和 4th)中,依此类推。

I came up with the following way to do this:我想出了以下方法来做到这一点:

word_appearances = {}
for word in words:
    appearances = df.str.count(word).clip(upper=1).sum()
    word_appearances.update({word: appearances})

pd.DataFrame(word_appearances.items())

This works fine, but the problem is that I have a rather long list of words (around 40,000), around 30,000 rows to check and strings that are not as short as the ones I used in the example.这工作正常,但问题是我有一个相当长的单词列表(大约 40,000 个),大约 30,000 行要检查,并且字符串不像我在示例中使用的那么短。 When I try my approach with my real data, it takes forever to run.当我用我的真实数据尝试我的方法时,它需要永远运行。 Is there a way to do this in a more efficient way?有没有办法以更有效的方式做到这一点?

Try list comprehension and str.contains and sum尝试列表理解和str.containssum

df_out = pd.DataFrame([[word, sum(df.str.contains(word))] for word in words], 
                       columns=['word', 'word_count'])

Out[58]:
  word  word_count
0   hi           3
1  bye           3
2  foo           3
3  bar           2
word_appearances = {}
for word in words:
    appearances = df.str.count(word).clip(upper=1).sum()
    word_appearances[word]= appearances

pd.DataFrame.from_dict(word_appearances,columns=['Frequency'],orient='index')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 计算每个单词出现的行数 - Count the number of rows that each word appears in 使用字典计算每个字母在单词中出现的次数 - count the number of times each alphabet appears in a word using dictionary 拆分列中的行并查找每个单词的出现次数,使用条形图查找哪个单词的计数最高 - split rows in a column and find the number of each word occurs, finding which one has the highest count using bar chart 如何计算一个单词出现在 csv 文件的每一行中的次数,在 python 中 - How to count the number of times a word appears in each row of a csv file, in python 如何计算每个单词在 txt 文件中出现的次数? - How can I count the number of times each word appears in a txt file? Python - 计数行数/列数值出现在 - Python - Count number of rows/columns value appears in 计算每个成绩在文件中出现的次数 - Count Number of times each grade appears in a file 计算每个术语出现在其中的文档数 - Counting number of document in which each term appears 计算每个令牌出现在DataFrame的每一行中的次数 - Count how many times each token appears in each rows of a DataFrame 计算字母在单词中出现的次数,并按以下格式放置它们: - Count the number of times a letter appears in a word and put them in the following format:
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM