在 FreqDist 中將兩個單詞分組為一個

Question

我的問題是我有一個包含推文數據的 Excel 文件。 我正在通過繪制單詞的頻率分布來進行文本分析。 第二個和第四個最常見的詞是“pakistan”和“pak”，它們的意思基本相同。 因此，我希望將它們視為一個並將它們分組。 這是代碼：

db=pd.read_excel(r'hello world.xlsx')
db['Sentence'] = db['Sentence'].astype(str).str.lower() #convert all text to lower case

regexp = RegexpTokenizer('\w+')
db['Sentence_token']=db['Sentence'].apply(regexp.tokenize)

stopwords = nltk.corpus.stopwords.words("english")
my_stopwords = []
stopwords.extend(my_stopwords)

db['Sentence_token'] = db['Sentence_token'].apply(lambda x: [item for item in x if item not in stopwords])   
db['Sentence_string'] = db['Sentence_token'].apply(lambda x: ' '.join([item for item in x if len(item)>0])) 

all_words = ' '.join([word for word in file['Sentence_string']])

tokenized_words = nltk.tokenize.word_tokenize(all_words)
fdist = FreqDist(tokenized_words)

db['Sentence_string_fdist'] = db['Sentence_token'].apply(lambda x: ' '.join([item for item in x if fdist[item] >= 2])) #drop words which occur less than 2 times
db[['Sentence', 'Sentence_token', 'Sentence_string', 'Sentence_string_fdist']]

fdist

Output：

FreqDist({'xxx': 870, 'pakistan': 466, 'xxx': 268, 'pak': 253, 'xxx': 253, 'xxx': 251, 'xxx': 237, ...})

Answer 1

FreqDist是一個collections.Counter ，它又是一個字典。 所以我們可以使用dict.pop方法同時獲取值和移除鍵。 假設我們要刪除'pak'並相應地增加'pakistan'的頻率。 為此，我們可以使用如下內容：

freq['pakistan'] += freq.pop('pak', 0)

在 FreqDist 中將兩個單詞分組為一個

問題描述

1 個解決方案

解決方案1
0 2022-08-13 12:16:43

在 FreqDist 中將兩個單詞分組為一個

問題描述

1 個解決方案

解決方案1 0 2022-08-13 12:16:43

解決方案1
0 2022-08-13 12:16:43