簡體   English   中英

根據數據幀在另一個數據幀中的頻率將值附加到一個數據幀

[英]Append values to one dataframe based on their frequency in another dataframe

我有兩個數據幀,df1是groupby或df.groupby('keyword')

df1

keyword     string

   A        "This is a test string for the example" 
            "This is also a test string based on the other string"
            "This string is a test string based on the other strings"
   B        "You can probably guess that this is also a test string"
            "Yet again, another test string"
            "This is also a test"

和df2

這是一個空數據幀,現在我還有一個特定值列表:

keyword_list = ['string', 'test']

基本上我想計算在keyword_listdf1的每個單詞的頻率,並且根據df1的關鍵字,將該單詞附加到新數據框中的特定列的單詞最多,所以df2的'A'被分配df1的string列中出現的最高值。

理想情況下,因為'string'是df1的A關鍵字列中出現的最高值,所以它會被賦予string等等。

df2

keyword    High_freq_word

   A         "string"
   B         "test"

如果您需要澄清或有意義,請告訴我!

更新:

@ anky_91提供了一些很棒的代碼,但輸出有點尷尬

df['matches'] = df.description.str.findall('|'.join(keyword_list))

    df.groupby(odf.Type.ffill()).matches.apply(lambda x: ''.join(mode(list(chain.from_iterable(x)))[0]))

得到你

DF1

keyword     string                                                     

   A        "This is a test string for the example" 
            "This is also a test string based on the other string"
            "This string is a test string based on the other strings"
   B        "You can probably guess that this is also a test string"
            "Yet again, another test string"
            "This is also a test"

但是它添加了一個新列:

matches

['string','test']
['test', 'string','string]
[etc...]

我可以想出一種方法來以數字方式轉換它,然后將該值分配給列,但更大的問題是將此新列附加到新數據幀。

由於它是一個groupby,有幾個重復的值,我試圖找到一種pythonic方式將“最常用的單詞”映射到關鍵字本身而不是基於關鍵字列表的整個模式。

據我所知,你可以這樣做:

from itertools import chain
from scipy.stats import mode

keyword_list = ['string', 'test']
df['matches']=df.string.str.findall('|'.join(keyword_list)) #find all matches
df.groupby(df.keyword.ffill()).matches.apply(lambda x: ''.join(mode(list(chain.from_iterable(x)))[0]))

keyword
A    string
B      test
Name: matches, dtype: object

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM