[英]Count number of different rows in which each word appears
I have a Pandas DataFrame (or a Series, given that I'm just using one column) that contains strings.我有一个包含字符串的 Pandas DataFrame(或一个系列,因为我只使用一列)。 I also have a list of words.
我也有一个单词列表。 For each word in this list, I want to check how many different rows it appears in at least once.
对于这个列表中的每个单词,我想检查它至少出现在多少不同的行中。 For example:
例如:
words = ['hi', 'bye', 'foo', 'bar']
df = pd.Series(["hi hi hi bye foo",
"bye bye bye bye",
"bar foo hi bar",
"hi bye foo bar"])
In this case, the output should be在这种情况下,输出应该是
0 hi 3
1 bye 3
2 foo 3
3 bar 2
Because "hi" appears in three different rows (1st, 3rd and 4th), "bar" appears in two (3rd and 4th), and so on.因为“hi”出现在三个不同的行(1st、3rd 和 4th)中,“bar”出现在两行(3rd 和 4th)中,依此类推。
I came up with the following way to do this:我想出了以下方法来做到这一点:
word_appearances = {}
for word in words:
appearances = df.str.count(word).clip(upper=1).sum()
word_appearances.update({word: appearances})
pd.DataFrame(word_appearances.items())
This works fine, but the problem is that I have a rather long list of words (around 40,000), around 30,000 rows to check and strings that are not as short as the ones I used in the example.这工作正常,但问题是我有一个相当长的单词列表(大约 40,000 个),大约 30,000 行要检查,并且字符串不像我在示例中使用的那么短。 When I try my approach with my real data, it takes forever to run.
当我用我的真实数据尝试我的方法时,它需要永远运行。 Is there a way to do this in a more efficient way?
有没有办法以更有效的方式做到这一点?
Try list comprehension and str.contains
and sum
尝试列表理解和
str.contains
和sum
df_out = pd.DataFrame([[word, sum(df.str.contains(word))] for word in words],
columns=['word', 'word_count'])
Out[58]:
word word_count
0 hi 3
1 bye 3
2 foo 3
3 bar 2
word_appearances = {}
for word in words:
appearances = df.str.count(word).clip(upper=1).sum()
word_appearances[word]= appearances
pd.DataFrame.from_dict(word_appearances,columns=['Frequency'],orient='index')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.