根據數據幀在另一個數據幀中的頻率將值附加到一個數據幀

Question

我有兩個數據幀，df1是groupby或df.groupby('keyword') ：

df1

keyword     string

   A        "This is a test string for the example" 
            "This is also a test string based on the other string"
            "This string is a test string based on the other strings"
   B        "You can probably guess that this is also a test string"
            "Yet again, another test string"
            "This is also a test"

和df2

這是一個空數據幀，現在我還有一個特定值列表：

keyword_list = ['string', 'test']

基本上我想計算在keyword_list和df1的每個單詞的頻率，並且根據df1的關鍵字，將該單詞附加到新數據框中的特定列的單詞最多，所以df2的'A'被分配df1的string列中出現的最高值。

理想情況下，因為'string'是df1的A關鍵字列中出現的最高值，所以它會被賦予string等等。

df2

keyword    High_freq_word

   A         "string"
   B         "test"

如果您需要澄清或有意義，請告訴我！

更新：

@ anky_91提供了一些很棒的代碼，但輸出有點尷尬

df['matches'] = df.description.str.findall('|'.join(keyword_list))

    df.groupby(odf.Type.ffill()).matches.apply(lambda x: ''.join(mode(list(chain.from_iterable(x)))[0]))

得到你

DF1

keyword     string                                                     

   A        "This is a test string for the example" 
            "This is also a test string based on the other string"
            "This string is a test string based on the other strings"
   B        "You can probably guess that this is also a test string"
            "Yet again, another test string"
            "This is also a test"

但是它添加了一個新列：

matches

['string','test']
['test', 'string','string]
[etc...]

我可以想出一種方法來以數字方式轉換它，然后將該值分配給列，但更大的問題是將此新列附加到新數據幀。

由於它是一個groupby，有幾個重復的值，我試圖找到一種pythonic方式將“最常用的單詞”映射到關鍵字本身而不是基於關鍵字列表的整個模式。

Answer 1

據我所知，你可以這樣做：

from itertools import chain
from scipy.stats import mode

keyword_list = ['string', 'test']
df['matches']=df.string.str.findall('|'.join(keyword_list)) #find all matches
df.groupby(df.keyword.ffill()).matches.apply(lambda x: ''.join(mode(list(chain.from_iterable(x)))[0]))

keyword
A    string
B      test
Name: matches, dtype: object

根據數據幀在另一個數據幀中的頻率將值附加到一個數據幀

問題描述

1 個解決方案

解決方案1
3 已采納 2019-05-28 16:17:06

根據數據幀在另一個數據幀中的頻率將值附加到一個數據幀

問題描述

1 個解決方案

解決方案1 3 已采納 2019-05-28 16:17:06

解決方案1
3 已采納 2019-05-28 16:17:06