簡體   English   中英

如何對每個大熊貓進行分組並獲得最常見的單詞和雙字母

[英]How to get group-by and get most frequent words and bigrams for each group pandas

我目前正在處理這樣的數據框:

 words:                               other:   category:    
 hello, jim, you, you , jim            val1      movie
 it, seems, bye, limb, pat, paddy      val2      movie
 how, are, you, are , kim              val1      television
 ......
 ......

我正在嘗試計算“類別”列中每個類別的前10個最常出現的單詞和雙字母組。 雖然,我想在將最常見的二元組分組到各自類別之前對其進行計算。

我的問題是,如果我按類別分組,然后獲得最常出現的十大雙字母組,則第一行的單詞將與第二行合並。

二元組應如下所示:

 (hello, jim), (jim, you), (you, you), (you, jim)
 (it, seems), (seems,bye), (bye, limb), (limb, pat), (pat, paddy)
 (how, are), (are, you), (you, are), (are, kim)

而如果我在獲得二元組之前分組,則二元組將是:

 (hello, jim), (jim, you), (you, you), (you, jim), (jim, it), (it, seems), (seems,bye), (bye, limb), (limb, pat), (pat, paddy)
 (how, are), (are, you), (you, are), (are, kim)

使用熊貓做到這一點的最佳方法是什么?

抱歉,如果我的問題不必要地復雜,我只想包括所有細節。 請讓我知道任何問題。

數據框示例:

                                   words other    category
0             hello, jim, you, you , jim  val1       movie
1  it, seems, bye, limb, pat, hello, jim  val2       movie
2               how, are, you, are , kim  val1  television

這是一種使用Pandas和.iterrows()計算.iterrows()數的方法:

bigrams = []
for idx, row in df.iterrows():
    lst = row['words'].split(',')
    bigrams.append([(lst[x].strip(), lst[x+1].strip()) for x in range(len(lst)-1)])

print(bigrams)
[[('hello', 'jim'), ('jim', 'you'), ('you', 'you'), ('you', 'jim')], 
[('it', 'seems'), ('seems', 'bye'), ('bye', 'limb'), ('limb', 'pat'), ('pat', 'hello'), ('hello', 'jim')], 
[('how', 'are'), ('are', 'you'), ('you', 'are'), ('are', 'kim')]]

這是使用Pandas和.apply的更有效的方法:

def bigram(row):
    lst = row['words'].split(', ')
    return [(lst[x].strip(), lst[x+1].strip()) for x in range(len(lst)-1)]

bigrams = df.apply(lambda row: bigram(row), axis=1)

print(bigrams.tolist())
[[('hello', 'jim'), ('jim', 'you'), ('you', 'you'), ('you', 'jim')], 
[('it', 'seems'), ('seems', 'bye'), ('bye', 'limb'), ('limb', 'pat'), ('pat', 'hello'), ('hello', 'jim')], 
[('how', 'are'), ('are', 'you'), ('you', 'are'), ('are', 'kim')]]

然后,您可以按類別對數據進行分組,並找到最常見的十大二元組。 以下是按類別查找最常見的二元組的示例:

df['bigrams'] = bigrams
df2 = df.groupby('category').agg({'bigrams': 'sum'})

# Compute the most frequent bigrams by category
from collections import Counter
df3 = df2.bigrams.apply(lambda row: Counter(row)).to_frame()

按類別分類的雙峰頻率字典:

print(df3)

                                                      bigrams
category                                                     
movie       {('hello', 'jim'): 2, ('jim', 'you'): 1, ('you...
television  {('how', 'are'): 1, ('are', 'you'): 1, ('you',...
# Filter to just the top 3 most frequent bigrams (or 10 if you have enough data)
df3.bigrams.apply(lambda row: list(row)[0:3])
category
movie         [(hello, jim), (jim, you), (you, you)]
television      [(how, are), (are, you), (you, are)]
Name: bigrams, dtype: object

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM