简体   繁体   English

在 FreqDist 中将两个单词分组为一个

[英]Grouping two words as one in FreqDist

My problem is that I have an Excel file with tweet data.我的问题是我有一个包含推文数据的 Excel 文件。 I am doing text analysis by plotting frequency distribution of words.我正在通过绘制单词的频率分布来进行文本分析。 The second and the forth most frequent words are 'pakistan' and 'pak' which basically means the same.第二个和第四个最常见的词是“pakistan”和“pak”,它们的意思基本相同。 So I want them to be considered as one and group them.因此,我希望将它们视为一个并将它们分组。 Here is the code:这是代码:

db=pd.read_excel(r'hello world.xlsx')
db['Sentence'] = db['Sentence'].astype(str).str.lower() #convert all text to lower case

regexp = RegexpTokenizer('\w+')
db['Sentence_token']=db['Sentence'].apply(regexp.tokenize)

stopwords = nltk.corpus.stopwords.words("english")
my_stopwords = []
stopwords.extend(my_stopwords)

db['Sentence_token'] = db['Sentence_token'].apply(lambda x: [item for item in x if item not in stopwords])   
db['Sentence_string'] = db['Sentence_token'].apply(lambda x: ' '.join([item for item in x if len(item)>0])) 

all_words = ' '.join([word for word in file['Sentence_string']])

tokenized_words = nltk.tokenize.word_tokenize(all_words)
fdist = FreqDist(tokenized_words)

db['Sentence_string_fdist'] = db['Sentence_token'].apply(lambda x: ' '.join([item for item in x if fdist[item] >= 2])) #drop words which occur less than 2 times
db[['Sentence', 'Sentence_token', 'Sentence_string', 'Sentence_string_fdist']]

fdist

Output: Output:

FreqDist({'xxx': 870, 'pakistan': 466, 'xxx': 268, 'pak': 253, 'xxx': 253, 'xxx': 251, 'xxx': 237, ...})

FreqDist is a collections.Counter , which in turn is a dictionary. FreqDist是一个collections.Counter ,它又是一个字典。 So we can use dict.pop method to get the value and remove the key at the same time.所以我们可以使用dict.pop方法同时获取值和移除键。 Let's say, we want to remove 'pak' and top up the frequency of 'pakistan' accordingly.假设我们要删除'pak'并相应地增加'pakistan'的频率。 To do this we can use something like this:为此,我们可以使用如下内容:

freq['pakistan'] += freq.pop('pak', 0)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM