[英]How to groupby a column in dataframe which contains a column containing list of tuples
我正在嘗試通過“類別”一列中的值對數據框進行分組。 雖然,“ prob”的其他列中的每一行都包含一個元組列表。 當我嘗試按“類別”分組時,“問題”列消失。
我目前的df:
category other: prob:
one val [(hi, hello), (jimbob, joe)]
one val2 [(this, not), (is, work), (now, any)]
two val2 [(bob, jones), (work, here)]
three val3 [(milk, coffee), (tea, bread)]
two val3 [(money, here), (job, money)]
預期產量:
category: other: prob:
one val, val2 [(hi, hello), (jimbob, joe), (this, not), (is, work), (now, any)]
two val2, val3 [(bob, jones), (work, here), (money, here), (job, money)]
three val3 [(money, here), (job, money)]
做這個的最好方式是什么? 抱歉,如果我對這個問題的措辭有誤,請讓我知道。 謝謝!
您可以通過GroupBy.agg
來聚合數據,其中join
用於字符串列,而flatten數據用於元組-添加了3個解決方案,僅在小數據和性能不重要的情況下才進行sum
使用:
import functools
import operator
from itertools import chain
f = lambda x: [z for y in x for z in y]
#faster alternative
#f = lambda x: list(chain.from_iterable(x))
#faster alternative2
#f = lambda x: functools.reduce(operator.iadd, x, [])
#slow alternative
#f = lambda x: x.sum()
df = df.groupby('category', as_index=False).agg({'other':', '.join, 'prob':f})
print (df)
category other prob
0 one val, val2 [(hi, hello), (jimbob, joe), (this, not), (is,...
1 three val3 [(milk, coffee), (tea, bread)]
2 two val2, val3 [(bob, jones), (work, here), (money, here), (j...
性能 :
測試代碼 :
np.random.seed(2019)
import perfplot
import functools
import operator
from itertools import chain
default_value = 10
def iadd(df1):
f = lambda x: functools.reduce(operator.iadd, x, [])
d = {'other':', '.join, 'prob':f}
return df1.groupby('category', as_index=False).agg(d)
def listcomp(df1):
f = lambda x: [z for y in x for z in y]
d = {'other':', '.join, 'prob':f}
return df1.groupby('category', as_index=False).agg(d)
def from_iterable(df1):
f = lambda x: list(chain.from_iterable(x))
d = {'other':', '.join, 'prob':f}
return df1.groupby('category', as_index=False).agg(d)
def sum_series(df1):
f = lambda x: x.sum()
d = {'other':', '.join, 'prob':f}
return df1.groupby('category', as_index=False).agg(d)
def sum_groupby_cat(df1):
d = {'other':lambda x: x.str.cat(sep=', '), 'prob':'sum'}
return df1.groupby('category', as_index=False).agg(d)
def sum_groupby_join(df1):
d = {'other': ', '.join, 'prob': 'sum'}
return df1.groupby('category', as_index=False).agg(d)
def make_df(n):
a = np.random.randint(0, n / 10, n)
b = np.random.choice(list('abcdef'), len(a))
c = [tuple(np.random.choice(list(string.ascii_letters), 2)) for _ in a]
df = pd.DataFrame({"category":a, "other":b, "prob":c})
df1 = df.groupby(['category','other'])['prob'].apply(list).reset_index()
return df1
perfplot.show(
setup=make_df,
kernels=[iadd, listcomp, from_iterable, sum_series,sum_groupby_cat,sum_groupby_join],
n_range=[10**k for k in range(1, 8)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
您可以對category
列進行GroupBy
,並使用以下功能進行匯總:
df.groupby('category', as_index=False).agg({'other':lambda x: x.str.cat(sep=', '),
'prob':'sum'})
前幾行給出:
category other prob
0 one val, val2 [(hi, hello), (jimbob, joe), (this, not), (is,...
1 two val2 [(bob, jones), (work, here)]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.