简体   繁体   English

python pandas groupby多列到一行

[英]python pandas groupby multiple columns to one row

I am wanting to group a dataframe on a key, in this case clientid. 我想将一个键上的数据框分组,在本例中为clientid。 And to have all columns of that grouped concatenated into one long row for each key. 并将该分组的所有列连接成每个键的一长行。

eg 例如

clientid, name, age, company
1,        tom,  31,  awesome
1,        jen,  26,  argos
2,        bob,  18,  hmv
3,        ted,  12,  mcdonalds
4,        sarah,30,  MnS
4,        mike, 52,  Mns
4,        luke, 75,  argos

wanted result 想要的结果

clientid, name, age, company,  name, age, company, name, age, company
1,        tom,  31,  awesome,  jen,  26,  argos,
2,        bob,  18,  hmv,
3,        ted,  12,  mcdonadls,
4,        sarah,30,  MnS,      mike, 52,  MnS,     luke, 75,  argos,

A similar question and solution was given 给出了类似的问题和解决方案

df_info = df1.groupby('clientid')['info'].unique().apply(pd.Series).reset_index()
info_len = len([col for col in df_info if str(col).isdigit()])
df_info.columns = ['clientid'] + ['info'] * info_len
df_info

But I can't find how to apply this to multiple columns 但是我找不到如何将此应用于多个列

This comes with a health warning, as you are losing much of the power of pandas by leaving the structure allowing you to groupby, the great performance, the powerful and clean syntax (so in some sense it's a feature you can't do this easily!)... and it's just not very pandorable. 这带有健康警告,因为您失去了熊猫的强大功能,离开了允许分组的结构,出色的性能,强大而简洁的语法(因此从某种意义上说,它是一个功能,您不能轻易做到这一点!)...而且不是很可笑。

So I strongly suggest not doing this, as there is almost certainly a better way to do whatever it is you're doing... 所以我强烈建议您不要这样做,因为几乎可以肯定,有一种更好的方法可以执行您正在执行的操作...


I think you need to groupby the clientid and then extract these strings... 我认为您需要对clientid进行分组,然后提取这些字符串...

In [11]: df1 = df.set_index('clientid')

In [12]: df1
Out[12]:
           name  age    company
clientid
1           tom   31    awesome
1           jen   26      argos
2           bob   18        hmv
3           ted   12  mcdonalds
4         sarah   30        MnS
4          mike   52        Mns
4          luke   75      argos

In [13]: g = df1.groupby(df1.index)

I would probably look into using to_csv over each group: 我可能会考虑在每个组上使用to_csv

In [14]: g.apply(lambda x: x.to_csv(header=False, index=False, line_terminator=','))
Out[14]:
clientid
1                      tom,31,awesome,jen,26,argos,
2                                       bob,18,hmv,
3                                 ted,12,mcdonalds,
4           sarah,30,MnS,mike,52,Mns,luke,75,argos,
dtype: object

An alternative is to apply: 一种替代方法是应用:

In [15]: g.apply(lambda x: pd.concat([row for _, row in x.iterrows()]).values)
Out[15]:
clientid
1                         [tom, 31, awesome, jen, 26, argos]
2                                             [bob, 18, hmv]
3                                       [ted, 12, mcdonalds]
4           [sarah, 30, MnS, mike, 52, Mns, luke, 75, argos]
dtype: object

You have to hack this a little to get the correct header: 您必须对此稍作修改以获取正确的标题:

In [16]: list(df1.columns) * g.apply(len).max()
Out[16]: ['name', 'age', 'company', 'name', 'age', 'company', 'name', 'age', 'company']

So, you can do something like the following: 因此,您可以执行以下操作:

In [21]: s = g.apply(lambda x: pd.concat([row for _, row in x.iterrows()]).values).apply(lambda row: ','.join([str(x) for x in row]))

In [22]: s.name = ','.join(list(df1.columns) * g.apply(len).max())

In [23]: s.to_frame().to_csv(quotechar=" ")  # Note: this is a hack since quoting=0 seems to be ignored
Out[23]: 'clientid, name,age,company,name,age,company,name,age,company \n1, tom,31,awesome,jen,26,argos \n2, bob,18,hmv \n3, ted,12,mcdonalds \n4, sarah,30,MnS,mike,52,Mns,luke,75,argos \n'

I have tried a couple of approaches and come up with a altered version of Andy's that I found works well. 我尝试了几种方法,并提出了一个我发现很好的Andy's的改进版本。

grouped = df1.groupby('clientid')
flattenedSeries = grouped.apply(lambda x: x.to_csv(header=False, index=False, line_terminator=','))
flattenedSeries = pd.DataFrame(flattenedSeries, columns=['data'])
ready = flattenedSeries['data'].apply(lambda x: pd.Series(x.split(',')))

Create new column headers 创建新的列标题

newcolumns = list(df1.columns) * grouped.apply(len).max()

adding mystery column to match with blank created in pd.Series(x.split(',')) 添加神秘列以与在pd.Series(x.split(','))中创建的空白匹配

newcolumns = newcolumns + ['extra']
ready.columns = newcolumns

giving index a type to help with future merges 给索引提供类型以帮助将来的合并

ready.index= ready.index.astype('int64')

The line terminator can be changed if its used in any of the column's data. 如果行终止符用于任何列数据中,则可以更改。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM