I have the following script to count the words in a column of my dataset:
df['open_answers']=df['open_answers'].apply(lambda x: ' '.join([item for item in x.split() if item not in stoplist]))
sequence_of_sentences=df['open answers']
from collections import Counter
counts=Counter()
for sentence in sequence_of_sentences:
counts.update(word.strip('.,?!"\"').lower() for word in sentence.split())
df1=(df['open_answers'].str.split(expand=True)
.stack()
.value_counts()
.rename_axis('word')
.reset_index(name='frequency'))
With this script I get a table with all the words that occur in the open answers and the frequency in which they occur. But I also want to look for combinations of words. For example, I know that the combination of 'different' and 'employees' seperately, instead of in combination.
Does anyone know how I can change the script above in order to get all the combinations of 2 words and frequencies?
You can make use of nltk.everygrams
to get the combination of words. And then use nltk.FreqDist
to get the frequency of each combination
Example:
import pandas as pd
from nltk import everygrams, word_tokenize, FreqDist
df = pd.DataFrame({'open_answers' : ['With this script I get a table with all the words that occur to look instead of I know',
'With this But I also want to look this for combinations script of words this I know',
'For example I know that the combination of different and employees seperately instead of in combination']})
Use everygrams(word_tokenize(x), 2, 2)
to get the two word combinations. If you want both one word and two word combinations, you can use everygrams(word_tokenize(x), 1, 2)
where 1 is min and 2 is max word combinations. So, you can get the combinations stored in a column.
df['combinations'] = df['open_answers'].apply(lambda x: [' '.join(word) for word in everygrams(word_tokenize(x), 2, 2)])
And then use nltk.FreqDist
to get the frequency distribution and store the result in a dataframe.
df1 = pd.DataFrame(list(FreqDist([''.join(y) for x in df['combinations'] for y in x]).items()), columns=['combination', 'freq'])
Result:
combination freq
0 With this 2
1 this script 1
2 script I 1
3 I get 1
4 get a 1
5 a table 1
6 table with 1
7 with all 1
8 all the 1
9 the words 1
10 words that 1
11 that occur 1
12 occur to 1
13 to look 2
14 look instead 1
15 instead of 2
16 of I 1
17 I know 3
18 this But 1
19 But I 1
...
...
41 seperately instead 1
42 of in 1
43 in combination 1
Edit 1
Eliminating punctuation characters using string.punctuation
and mapping them to blank using translate()
df['combinations'] = df['open_answers']\
.apply(lambda x: [' '.join(word) for word in everygrams(word_tokenize(x.translate(str.maketrans('', '', string.punctuation))), 2, 2)])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.