All combinations of 2 words in dataset

Question

I have the following script to count the words in a column of my dataset:

df['open_answers']=df['open_answers'].apply(lambda x: ' '.join([item for item in x.split() if item not in stoplist]))
sequence_of_sentences=df['open answers']
from collections import Counter
counts=Counter()
for sentence in sequence_of_sentences:
   counts.update(word.strip('.,?!"\"').lower() for word in sentence.split())

df1=(df['open_answers'].str.split(expand=True)
        .stack()
        .value_counts()
        .rename_axis('word')
        .reset_index(name='frequency'))

With this script I get a table with all the words that occur in the open answers and the frequency in which they occur. But I also want to look for combinations of words. For example, I know that the combination of 'different' and 'employees' seperately, instead of in combination.

Does anyone know how I can change the script above in order to get all the combinations of 2 words and frequencies?

Answer 1

You can make use of nltk.everygrams to get the combination of words. And then use nltk.FreqDist to get the frequency of each combination

Example:

import pandas as pd
from nltk import everygrams, word_tokenize, FreqDist

df = pd.DataFrame({'open_answers' : ['With this script I get a table with all the words that occur to look instead of I know',
                                     'With this But I also want to look this for combinations script of words this I know',
                                     'For example I know that the combination of different and employees seperately instead of in combination']})

Use everygrams(word_tokenize(x), 2, 2) to get the two word combinations. If you want both one word and two word combinations, you can use everygrams(word_tokenize(x), 1, 2) where 1 is min and 2 is max word combinations. So, you can get the combinations stored in a column.

df['combinations'] = df['open_answers'].apply(lambda x: [' '.join(word) for word in everygrams(word_tokenize(x), 2, 2)])

And then use nltk.FreqDist to get the frequency distribution and store the result in a dataframe.

df1 = pd.DataFrame(list(FreqDist([''.join(y) for x in df['combinations'] for y in x]).items()), columns=['combination', 'freq'])

Result:

             combination  freq
0              With this     2
1            this script     1
2               script I     1
3                  I get     1
4                  get a     1
5                a table     1
6             table with     1
7               with all     1
8                all the     1
9              the words     1
10            words that     1
11            that occur     1
12              occur to     1
13               to look     2
14          look instead     1
15            instead of     2
16                  of I     1
17                I know     3
18              this But     1
19                 But I     1
...
...
41    seperately instead     1
42                 of in     1
43        in combination     1

Edit 1

Eliminating punctuation characters using string.punctuation and mapping them to blank using translate()

df['combinations'] = df['open_answers']\
    .apply(lambda x: [' '.join(word) for word in everygrams(word_tokenize(x.translate(str.maketrans('', '', string.punctuation))), 2, 2)])

All combinations of 2 words in dataset

Question

1 answers

solution1
0 ACCPTED 2020-06-26 19:07:22

All combinations of 2 words in dataset

Question

1 answers

solution1 0 ACCPTED 2020-06-26 19:07:22

solution1
0 ACCPTED 2020-06-26 19:07:22