[英]Is it possible to add two words together while counting the word frequencies? Python
top_N = 100
words = review_tip['user_tip'].dropna()
words = words.astype(str)
words = words.str.replace('[{}]'.format(string.punctuation), '')
words = words.str.lower().apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))
# replace '|'-->' ' and drop all stopwords
words = words.str.lower().replace([r'\|', RE_stopwords], [' ', ''], regex=True).str.cat(sep=' ').split()
# generate DF out of Counter
rslt = pd.DataFrame(Counter(words).most_common(top_N),
columns=['Word', 'Frequency']).set_index('Word')
print(rslt)
plt.clf()
# plot
rslt.plot.bar(rot=90, figsize=(16,10), width=0.8)
plt.show()
Frequency
Word
great 17069
food 16381
good 12502
service 11342
place 10841
best 9280
get 7483
love 7042
amazing 5043
try 4945
time 4810
go 4594
dont 4377
As you can see the words are singular which is something I can use, but is it possible to take like two words that couldve been used together a lot? 如您所见,单词是单数形式,这是我可以使用的词,但是是否有可能采用像两个单词一样可以经常使用的单词呢?
For example getting 例如得到
dont go (this could be for 100 times) 不要走(这可能是100次)
instead of getting it separate 而不是分开
dont 100 不要100
go 100 走100
This will generate bi-grams, is this what you are looking for: 这将生成二元语法,这是您要查找的内容:
bi_grams = zip(words, words[1:])
I generates tuples which is fine to use in the counter but you could also easily tweak the code to use ' '.join((a, b))
. 我生成了可以在计数器中使用的元组,但是您也可以轻松地调整代码以使用' '.join((a, b))
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.