[英]Reshaping a column into rows group-wise using Pandas
i have a df 我有一个df
id name value
1 abc 10
1 qwe 23
1 zxc 12
2 sdf 10
2 wed 23
2 abc 12
2 mnb 11
i want to reshape this dataframe into: 我想将此数据框重塑为:
id n1 n2 n3 n4
1 abc qwe zxc 0
2 sdf wed abc mnb
we can see that there are 3 rows for id=1 and 4 rows for id=2. 我们可以看到id = 1的行有3行,id = 2的行有4行。 Therefor replace the last column n4=0 incase of such happenings.
因此,在发生这种情况时,请替换最后一列n4 = 0。
This is test dataframe, it may happen that, for a id there might by 1-2 rows too.
这是测试数据帧,可能会发生,对于一个id可能也有1-2行。
it is something like we do in R - dcast. 就像我们在R-dcast中所做的一样。 how can we do this in pandas?
我们如何在熊猫中做到这一点?
Possibly Overkill 可能过度杀伤
f, u = pd.factorize(df.id.values)
b = np.bincount(f)
n, m = u.size, b.max()
c = np.arange(f.size) - np.arange(n).repeat(b) * (m - 1)
v = np.zeros((n, m), dtype=object)
v[f, c] = df.name.values
pd.DataFrame(
v, pd.Index(u, name='id'),
['n{}'.format(i) for i in range(1, m + 1)]
).reset_index()
id n1 n2 n3 n4
0 1 abc qwe zxc 0
1 2 sdf wed abc mnb
You could go the str
route and use some regex replacement and splitting after the groupby
. 您可以走
str
路线,在groupby
之后使用一些正则表达式替换和拆分。
df.groupby('id').name.apply(lambda x: str(list(x)))\
.str.replace("[\[\],']", "")\
.str.split(expand=True).fillna(0)\
.rename(columns = lambda x: 'n{}'.format(x + 1))
n1 n2 n3 n4
id
1 abc qwe zxc 0
2 sdf wed abc mnb
You can use set_index
with cumcount
for counts per groups for new columns names and reshape by unstack
, last rename columns: 您可以将
set_index
与cumcount
一起用于新列名称的每组计数,并通过unstack
,last重命名列进行重塑:
df = (df.set_index(['id', df.groupby('id').cumcount()])['name']
.unstack(fill_value=0)
.rename(columns = lambda x: 'n{}'.format(x + 1))
.reset_index())
print (df)
id n1 n2 n3 n4
0 1 abc qwe zxc 0
1 2 sdf wed abc mnb
Solution with DataFrame
constructor, is necessary no NaN values in original data: 使用
DataFrame
构造函数的解决方案,必须在原始数据中没有NaN值:
df1 = df.groupby('id')['name'].apply(list)
print (df1)
id
1 [abc, qwe, zxc]
2 [sdf, wed, abc, mnb]
Name: name, dtype: object
df = (pd.DataFrame(df1.values.tolist(), index=df1.index)
.fillna(0)
.rename(columns = lambda x: 'n{}'.format(x + 1))
.reset_index())
print (df)
id n1 n2 n3 n4
0 1 abc qwe zxc 0
1 2 sdf wed abc mnb
And solution with GroupBy.apply
and Series
constructor: 以及使用
GroupBy.apply
和Series
构造函数的解决方案:
df1 = (df.groupby('id')['name'].apply(lambda x: pd.Series(x.values, index=range(1,len(x)+1)))
.unstack(fill_value=0)
.add_prefix('n')
.reset_index())
print (df1)
id n1 n2 n3 n4
0 1 abc qwe zxc 0
1 2 sdf wed abc mnb
By using dfply
package it is possible to do like R's dcast
. 通过使用
dfply
包,可以像R的dcast
一样进行操作。
# for Python3 only
pip install dfply
Use the spread
function of dfply
. 使用
dfply
的spread
功能。
import pandas as pd
from io import StringIO
from dfply import *
csv = StringIO("""id,name,value
1,abc,10
1,qwe,23
1,zxc,12
2,sdf,10
2,wed,23
2,abc,12
2,mnb,11""")
df = pd.read_csv(csv)
df['sequence'] = df.groupby('id').cumcount()
df = df[["id", "sequence", "name"]] >> spread(X.sequence, X.name)
df = df.set_index("id").fillna(0).rename(columns = lambda x: 'n{}'.format(x + 1)).reset_index()
print(df)
# id n1 n2 n3 n4
# 0 1 abc qwe zxc 0
# 1 2 sdf wed abc mnb
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.