简体   繁体   English

在计算单词频率时是否可以将两个单词加在一起? 蟒蛇

[英]Is it possible to add two words together while counting the word frequencies? Python

top_N = 100

words = review_tip['user_tip'].dropna()
words = words.astype(str)
words = words.str.replace('[{}]'.format(string.punctuation), '')
words = words.str.lower().apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))

# replace '|'-->' ' and drop all stopwords
words = words.str.lower().replace([r'\|', RE_stopwords], [' ', ''], regex=True).str.cat(sep=' ').split()

# generate DF out of Counter
rslt = pd.DataFrame(Counter(words).most_common(top_N),
                    columns=['Word', 'Frequency']).set_index('Word')
print(rslt)
plt.clf()
# plot
rslt.plot.bar(rot=90, figsize=(16,10), width=0.8)
plt.show()
            Frequency
Word                 
great           17069

food            16381

good            12502

service         11342

place           10841

best             9280

get              7483

love             7042

amazing          5043

try              4945

time             4810

go               4594

dont             4377

As you can see the words are singular which is something I can use, but is it possible to take like two words that couldve been used together a lot? 如您所见,单词是单数形式,这是我可以使用的词,但是是否有可能采用像两个单词一样可以经常使用的单词呢?

For example getting 例如得到

dont go (this could be for 100 times) 不要走(这可能是100次)

instead of getting it separate 而不是分开

dont 100 不要100

go 100 走100

This will generate bi-grams, is this what you are looking for: 这将生成二元语法,这是您要查找的内容:

bi_grams = zip(words, words[1:])

I generates tuples which is fine to use in the counter but you could also easily tweak the code to use ' '.join((a, b)) . 我生成了可以在计数器中使用的元组,但是您也可以轻松地调整代码以使用' '.join((a, b))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM