简体   繁体   中英

All combinations of 2 words in dataset

I have the following script to count the words in a column of my dataset:

df['open_answers']=df['open_answers'].apply(lambda x: ' '.join([item for item in x.split() if item not in stoplist]))
sequence_of_sentences=df['open answers']
from collections import Counter
for sentence in sequence_of_sentences:
   counts.update(word.strip('.,?!"\"').lower() for word in sentence.split())


With this script I get a table with all the words that occur in the open answers and the frequency in which they occur. But I also want to look for combinations of words. For example, I know that the combination of 'different' and 'employees' seperately, instead of in combination.

Does anyone know how I can change the script above in order to get all the combinations of 2 words and frequencies?

You can make use of nltk.everygrams to get the combination of words. And then use nltk.FreqDist to get the frequency of each combination


import pandas as pd
from nltk import everygrams, word_tokenize, FreqDist

df = pd.DataFrame({'open_answers' : ['With this script I get a table with all the words that occur to look instead of I know',
                                     'With this But I also want to look this for combinations script of words this I know',
                                     'For example I know that the combination of different and employees seperately instead of in combination']})

Use everygrams(word_tokenize(x), 2, 2) to get the two word combinations. If you want both one word and two word combinations, you can use everygrams(word_tokenize(x), 1, 2) where 1 is min and 2 is max word combinations. So, you can get the combinations stored in a column.

df['combinations'] = df['open_answers'].apply(lambda x: [' '.join(word) for word in everygrams(word_tokenize(x), 2, 2)])

And then use nltk.FreqDist to get the frequency distribution and store the result in a dataframe.

df1 = pd.DataFrame(list(FreqDist([''.join(y) for x in df['combinations'] for y in x]).items()), columns=['combination', 'freq'])


             combination  freq
0              With this     2
1            this script     1
2               script I     1
3                  I get     1
4                  get a     1
5                a table     1
6             table with     1
7               with all     1
8                all the     1
9              the words     1
10            words that     1
11            that occur     1
12              occur to     1
13               to look     2
14          look instead     1
15            instead of     2
16                  of I     1
17                I know     3
18              this But     1
19                 But I     1
41    seperately instead     1
42                 of in     1
43        in combination     1

Edit 1

Eliminating punctuation characters using string.punctuation and mapping them to blank using translate()

df['combinations'] = df['open_answers']\
    .apply(lambda x: [' '.join(word) for word in everygrams(word_tokenize(x.translate(str.maketrans('', '', string.punctuation))), 2, 2)])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM