Grouping two words as one in FreqDist

Question

My problem is that I have an Excel file with tweet data. I am doing text analysis by plotting frequency distribution of words. The second and the forth most frequent words are 'pakistan' and 'pak' which basically means the same. So I want them to be considered as one and group them. Here is the code:

db=pd.read_excel(r'hello world.xlsx')
db['Sentence'] = db['Sentence'].astype(str).str.lower() #convert all text to lower case

regexp = RegexpTokenizer('\w+')
db['Sentence_token']=db['Sentence'].apply(regexp.tokenize)

stopwords = nltk.corpus.stopwords.words("english")
my_stopwords = []
stopwords.extend(my_stopwords)

db['Sentence_token'] = db['Sentence_token'].apply(lambda x: [item for item in x if item not in stopwords])   
db['Sentence_string'] = db['Sentence_token'].apply(lambda x: ' '.join([item for item in x if len(item)>0])) 

all_words = ' '.join([word for word in file['Sentence_string']])

tokenized_words = nltk.tokenize.word_tokenize(all_words)
fdist = FreqDist(tokenized_words)

db['Sentence_string_fdist'] = db['Sentence_token'].apply(lambda x: ' '.join([item for item in x if fdist[item] >= 2])) #drop words which occur less than 2 times
db[['Sentence', 'Sentence_token', 'Sentence_string', 'Sentence_string_fdist']]

fdist

Output:

FreqDist({'xxx': 870, 'pakistan': 466, 'xxx': 268, 'pak': 253, 'xxx': 253, 'xxx': 251, 'xxx': 237, ...})

Answer 1

FreqDist is a collections.Counter , which in turn is a dictionary. So we can use dict.pop method to get the value and remove the key at the same time. Let's say, we want to remove 'pak' and top up the frequency of 'pakistan' accordingly. To do this we can use something like this:

freq['pakistan'] += freq.pop('pak', 0)

Grouping two words as one in FreqDist

Question

1 answers

solution1
0 2022-08-13 12:16:43

Grouping two words as one in FreqDist

Question

1 answers

solution1 0 2022-08-13 12:16:43

solution1
0 2022-08-13 12:16:43