數據集中 2 個單詞的所有組合

Question

我有以下腳本來計算我的數據集列中的單詞：

df['open_answers']=df['open_answers'].apply(lambda x: ' '.join([item for item in x.split() if item not in stoplist]))
sequence_of_sentences=df['open answers']
from collections import Counter
counts=Counter()
for sentence in sequence_of_sentences:
   counts.update(word.strip('.,?!"\"').lower() for word in sentence.split())

df1=(df['open_answers'].str.split(expand=True)
        .stack()
        .value_counts()
        .rename_axis('word')
        .reset_index(name='frequency'))

使用這個腳本，我得到一個表格，其中包含所有出現在開放答案中的單詞以及它們出現的頻率。 但我也想尋找單詞的組合。 例如，我知道“不同”和“員工”的組合是分開的，而不是組合的。

有誰知道我如何更改上面的腳本以獲得 2 個單詞和頻率的所有組合？

Answer 1

您可以使用nltk.everygrams來獲取單詞的組合。 然后使用nltk.FreqDist得到每個組合的頻率

例子：

import pandas as pd
from nltk import everygrams, word_tokenize, FreqDist

df = pd.DataFrame({'open_answers' : ['With this script I get a table with all the words that occur to look instead of I know',
                                     'With this But I also want to look this for combinations script of words this I know',
                                     'For example I know that the combination of different and employees seperately instead of in combination']})

使用everygrams(word_tokenize(x), 2, 2)得到兩個單詞組合。 如果你想要一個單詞和兩個單詞的組合，你可以使用everygrams(word_tokenize(x), 1, 2)其中 1 是最小的，2 是最大的單詞組合。 因此，您可以獲得存儲在列中的組合。

df['combinations'] = df['open_answers'].apply(lambda x: [' '.join(word) for word in everygrams(word_tokenize(x), 2, 2)])

然后使用nltk.FreqDist獲取頻率分布並將結果存儲在 dataframe 中。

df1 = pd.DataFrame(list(FreqDist([''.join(y) for x in df['combinations'] for y in x]).items()), columns=['combination', 'freq'])

結果：

             combination  freq
0              With this     2
1            this script     1
2               script I     1
3                  I get     1
4                  get a     1
5                a table     1
6             table with     1
7               with all     1
8                all the     1
9              the words     1
10            words that     1
11            that occur     1
12              occur to     1
13               to look     2
14          look instead     1
15            instead of     2
16                  of I     1
17                I know     3
18              this But     1
19                 But I     1
...
...
41    seperately instead     1
42                 of in     1
43        in combination     1

編輯 1

使用string.punctuation消除標點符號並使用translate()將它們映射為空白

df['combinations'] = df['open_answers']\
    .apply(lambda x: [' '.join(word) for word in everygrams(word_tokenize(x.translate(str.maketrans('', '', string.punctuation))), 2, 2)])

數據集中 2 個單詞的所有組合

問題描述

1 個解決方案

解決方案1
0 已采納 2020-06-26 19:07:22

數據集中 2 個單詞的所有組合

問題描述

1 個解決方案

解決方案1 0 已采納 2020-06-26 19:07:22

解決方案1
0 已采納 2020-06-26 19:07:22