[英]Python: Count of occurrences in dict from another list
I am trying to count the number of times a word exists in a dict column based on a subset of interested words.我正在尝试根据感兴趣的单词子集计算某个单词在 dict 列中存在的次数。
First I import my data首先我导入我的数据
products = graphlab.SFrame('amazon_baby.gl/')
products['word_count'] = graphlab.text_analytics.count_words(products['review'])
products.head(5)
Data can be found here: https://drive.google.com/open?id=0BzbhZp-qIglxM3VSVWRsVFRhTWc数据可以在这里找到: https : //drive.google.com/open?id=0BzbhZp-qIglxM3VSVWRsVFRhTWc
I then create list of words i am interested in:然后我创建我感兴趣的单词列表:
words = ['awesome', 'great', 'fantastic']
I would like to count the number of times each word in "words" occurs in the products['word_count'].我想计算“words”中每个单词在 products['word_count'] 中出现的次数。
I am not married to using graphlab.我不喜欢使用graphlab。 It was just suggested to me by a colleague.
刚好是同事推荐给我的。
Well, I am not pretty sure about what you mean by 'in a dict column'.好吧,我不太确定您所说的“在字典列中”是什么意思。 If it is a list:
如果是列表:
import operator
dictionary={'texts':['red blue blue','red black','blue white white','red','white','black','blue red']}
words=['red','white','blue']
freqs=dict()
for t in dictionary['texts']:
for w in words:
try:
freqs[w]+=t.count(w)
except:
freqs[w]=t.count(w)
top_words = sorted(freqs.items(), key=operator.itemgetter(1),reverse=True)
If it is just one text:如果它只是一个文本:
import operator
dictionary={'text':'red blue blue red black blue white white red white black blue red'}
words=['red','white','blue']
freqs=dict()
for w in words:
try:
freqs[w]+=dictionary['text'].count(w)
except:
freqs[w]=dictionary['text'].count(w)
top_words = sorted(freqs.items(), key=operator.itemgetter(1),reverse=True)
If you want to count occurrences of words, a fast way to do it is to use Counter
object from collections
如果你想计算单词的出现次数,一个快速的方法是使用
collections
Counter
对象
For example :例如:
In [3]: from collections import Counter
In [4]: c = Counter(['hello', 'world'])
In [5]: c
Out[5]: Counter({'hello': 1, 'world': 1})
Could you show the output of your products.head(5)
command ?你能显示你的
products.head(5)
命令的输出吗?
If you stick with graphlab (or SFrame), use the SArray.dict_trim_by_keys
method.如果您坚持使用 graphlab(或 SFrame),请使用
SArray.dict_trim_by_keys
方法。 The documentation is here: https://dato.com/products/create/docs/generated/graphlab.SArray.dict_trim_by_keys.html文档在这里: https : //dato.com/products/create/docs/generated/graphlab.SArray.dict_trim_by_keys.html
import graphlab as gl
sf = gl.SFrame({'review': ['what a good book', 'terrible book']})
sf['word_bag'] = gl.text_analytics.count_words(sf['review'])
keywords = ['good', 'book']
sf['key_words'] = sf['word_bag'].dict_trim_by_keys(keywords, exclude=False)
print sf
+------------------+---------------------+---------------------+
| review | word_bag | key_words |
+------------------+---------------------+---------------------+
| what a good book | {'a': 1, 'good':... | {'good': 1, 'boo... |
| terrible book | {'book': 1, 'ter... | {'book': 1} |
+------------------+---------------------+---------------------+
[2 rows x 3 columns]
Do you want to put each of the counts in a separate column?您想将每个计数放在单独的列中吗? In that case this may work:
在这种情况下,这可能有效:
keywords = ['keyword1' , 'keyword2']
def word_counter(dict_cell , word):
if word in dict_cell:
return dict_cell[word]
else:
return 0
for words in keywords:
df[words] = df['word_count'].apply(lambda x:word_counter(x,words))
def count_words(x, w):
if w in x:
return x.count(w)
else:
return 0
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
for words in selected_words:
products[words]=products['review'].apply(lambda x:count_words(x,words))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.