數據集中 2 個單詞的所有組合

[英]All combinations of 2 words in dataset


df['open_answers']=df['open_answers'].apply(lambda x: ' '.join([item for item in x.split() if item not in stoplist]))
sequence_of_sentences=df['open answers']
from collections import Counter
for sentence in sequence_of_sentences:
   counts.update(word.strip('.,?!"\"').lower() for word in sentence.split())


使用這個腳本,我得到一個表格,其中包含所有出現在開放答案中的單詞以及它們出現的頻率。 但我也想尋找單詞的組合。 例如,我知道“不同”和“員工”的組合是分開的,而不是組合的。

有誰知道我如何更改上面的腳本以獲得 2 個單詞和頻率的所有組合?

您可以使用nltk.everygrams來獲取單詞的組合。 然后使用nltk.FreqDist得到每個組合的頻率


import pandas as pd
from nltk import everygrams, word_tokenize, FreqDist

df = pd.DataFrame({'open_answers' : ['With this script I get a table with all the words that occur to look instead of I know',
                                     'With this But I also want to look this for combinations script of words this I know',
                                     'For example I know that the combination of different and employees seperately instead of in combination']})

使用everygrams(word_tokenize(x), 2, 2)得到兩個單詞組合。 如果你想要一個單詞和兩個單詞的組合,你可以使用everygrams(word_tokenize(x), 1, 2)其中 1 是最小的,2 是最大的單詞組合。 因此,您可以獲得存儲在列中的組合。

df['combinations'] = df['open_answers'].apply(lambda x: [' '.join(word) for word in everygrams(word_tokenize(x), 2, 2)])

然后使用nltk.FreqDist獲取頻率分布並將結果存儲在 dataframe 中。

df1 = pd.DataFrame(list(FreqDist([''.join(y) for x in df['combinations'] for y in x]).items()), columns=['combination', 'freq'])


             combination  freq
0              With this     2
1            this script     1
2               script I     1
3                  I get     1
4                  get a     1
5                a table     1
6             table with     1
7               with all     1
8                all the     1
9              the words     1
10            words that     1
11            that occur     1
12              occur to     1
13               to look     2
14          look instead     1
15            instead of     2
16                  of I     1
17                I know     3
18              this But     1
19                 But I     1
41    seperately instead     1
42                 of in     1
43        in combination     1

編輯 1


df['combinations'] = df['open_answers']\
    .apply(lambda x: [' '.join(word) for word in everygrams(word_tokenize(x.translate(str.maketrans('', '', string.punctuation))), 2, 2)])


