如何在數據框中按列分組，該列包含一個包含元組列表的列

Question

我正在嘗試通過“類別”一列中的值對數據框進行分組。 雖然，“ prob”的其他列中的每一行都包含一個元組列表。 當我嘗試按“類別”分組時，“問題”列消失。

我目前的df：

 category          other:          prob:
   one              val         [(hi, hello), (jimbob, joe)]
   one              val2        [(this, not), (is, work), (now, any)]
   two              val2        [(bob, jones), (work, here)]
   three            val3        [(milk, coffee), (tea, bread)]
   two              val3        [(money, here), (job, money)]

預期產量：

 category:           other:         prob:
   one             val, val2     [(hi, hello), (jimbob, joe), (this, not), (is, work), (now, any)]
   two             val2, val3    [(bob, jones), (work, here), (money, here), (job, money)]
   three           val3          [(money, here), (job, money)]

做這個的最好方式是什么？ 抱歉，如果我對這個問題的措辭有誤，請讓我知道。 謝謝！

Answer 1

您可以通過GroupBy.agg來聚合數據，其中join用於字符串列，而flatten數據用於元組-添加了3個解決方案，僅在小數據和性能不重要的情況下才進行sum使用：

import functools
import operator

from  itertools import chain

f = lambda x: [z for y in x for z in y]
#faster alternative
#f = lambda x: list(chain.from_iterable(x))
#faster alternative2
#f = lambda x: functools.reduce(operator.iadd, x, [])
#slow alternative
#f = lambda x: x.sum()
df = df.groupby('category', as_index=False).agg({'other':', '.join, 'prob':f})

print (df)
  category       other                                               prob
0      one   val, val2  [(hi, hello), (jimbob, joe), (this, not), (is,...
1    three        val3                     [(milk, coffee), (tea, bread)]
2      two  val2, val3  [(bob, jones), (work, here), (money, here), (j...

性能：

測試代碼 ：

np.random.seed(2019)

import perfplot
import functools
import operator

from  itertools import chain


default_value = 10

def iadd(df1):
    f = lambda x: functools.reduce(operator.iadd, x, [])
    d = {'other':', '.join, 'prob':f}
    return df1.groupby('category', as_index=False).agg(d)

def listcomp(df1):
    f = lambda x: [z for y in x for z in y]
    d = {'other':', '.join, 'prob':f}
    return df1.groupby('category', as_index=False).agg(d)

def from_iterable(df1):
    f = lambda x: list(chain.from_iterable(x))
    d = {'other':', '.join, 'prob':f}
    return df1.groupby('category', as_index=False).agg(d)

def sum_series(df1):
    f = lambda x: x.sum()
    d = {'other':', '.join, 'prob':f}
    return df1.groupby('category', as_index=False).agg(d)

def sum_groupby_cat(df1):
    d = {'other':lambda x: x.str.cat(sep=', '), 'prob':'sum'}
    return df1.groupby('category', as_index=False).agg(d)

def sum_groupby_join(df1):
    d = {'other': ', '.join, 'prob': 'sum'}
    return df1.groupby('category', as_index=False).agg(d)


def make_df(n):
    a = np.random.randint(0, n / 10, n)
    b = np.random.choice(list('abcdef'), len(a))
    c = [tuple(np.random.choice(list(string.ascii_letters), 2)) for _ in a]
    df = pd.DataFrame({"category":a, "other":b, "prob":c})
    df1 = df.groupby(['category','other'])['prob'].apply(list).reset_index()
    return df1

perfplot.show(
    setup=make_df,
    kernels=[iadd, listcomp, from_iterable, sum_series,sum_groupby_cat,sum_groupby_join],
    n_range=[10**k for k in range(1, 8)],
    logx=True,
    logy=True,
    equality_check=False,
    xlabel='len(df)')

Answer 2

您可以對category列進行GroupBy ，並使用以下功能進行匯總：

df.groupby('category', as_index=False).agg({'other':lambda x: x.str.cat(sep=', '),
                                            'prob':'sum'})

前幾行給出：

   category   other                             prob
0      one  val, val2  [(hi, hello), (jimbob, joe), (this, not), (is,...
1      two      val2                       [(bob, jones), (work, here)]

Answer 3

嘗試使用groupby（） + agg（）：

df.groupby('category').agg({'other': ', '.join, 'prob': 'sum'})

如何在數據框中按列分組，該列包含一個包含元組列表的列

問題描述

3 個解決方案

解決方案1
4 2019-03-26 12:58:59

解決方案2
2 2019-03-26 13:02:29

解決方案3
0 2019-03-26 19:41:29

如何在數據框中按列分組，該列包含一個包含元組列表的列

問題描述

3 個解決方案

解決方案1 4 2019-03-26 12:58:59

解決方案2 2 2019-03-26 13:02:29

解決方案3 0 2019-03-26 19:41:29

解決方案1
4 2019-03-26 12:58:59

解決方案2
2 2019-03-26 13:02:29

解決方案3
0 2019-03-26 19:41:29