[英]“Expand” pandas dataframe by values in column
Lets say I start with a dataframe that has some data and a column of quantities: 假设我从一个包含一些数据和一列数量的数据框开始:
In: df=pd.DataFrame({'first-name':['Jan','Leilani'],'Qty':[2,4]})
Out: Qty first-name
2 Jan
4 Leilani
I want to create a dataframe that copies and labels the data into new lines a number of times equal to the quantity on each line. 我想创建一个数据帧,将数据复制并标记为新行,其次数等于每行的数量。 Here is what the output should look like:
这是输出应该是什么样子:
Qty first-name position
2 Jan 1
2 Jan 2
4 Leilani 1
4 Leilani 2
4 Leilani 3
4 Leilani 4
I can do this using python like so: 我可以使用python这样做:
l=[]
x=0
for idx in df.index:
x=0
for _ in range(df.loc[idx]['Qty']):
x+=1
tempSrs=df.loc[idx]
tempSrs['position']=x
l.append(tempSrs)
outDf=pd.DataFrame(l)
This is very slow. 这很慢。 Is there a way to do this using pandas functions?
有没有办法使用pandas功能? This is effectively an "unpivot", which in pandas is "melt", but I wasn't able to figure out how to use the melt function to accomplish this.
这实际上是一个“不透明”,在熊猫中是“融化”,但我无法弄清楚如何使用融化功能来实现这一目标。
Thanks, 谢谢,
With repeat
and cumcount
随着
repeat
和cumcount
Newdf=df.reindex(df.index.repeat(df.Qty))
Newdf['position']=Newdf.groupby(level=0).cumcount()+1
Newdf
Out[931]:
Qty first-name position
0 2 jan 1
0 2 jan 2
1 4 jay 1
1 4 jay 2
1 4 jay 3
1 4 jay 4
The differences are: 不同之处是:
loc
instead of reindex
(same thing) loc
而不是reindex
(同样的事情) assign
instead of =
assignment ( assign
produces a copy) assign
而不是=
赋值( assign
生成副本) lambda
to assign
to embed groupby
logic lambda
到assign
嵌入groupby
逻辑 df.loc[df.index.repeat(df.Qty)].assign(
position=lambda d: d.groupby('first-name').cumcount() + 1
)
Qty first-name position
0 2 jan 1
0 2 jan 2
1 4 jay 1
1 4 jay 2
1 4 jay 3
1 4 jay 4
np.arange
np.arange
q = df.Qty.values
r = np.arange(q.sum()) - np.append(0, q[:-1]).cumsum().repeat(q) + 1
df.loc[df.index.repeat(q)].assign(position=r)
Qty first-name position
0 2 jan 1
0 2 jan 2
1 4 jay 1
1 4 jay 2
1 4 jay 3
1 4 jay 4
Here is an intuitive way using numpy.repeat
and itertools.chain
. 这是使用
numpy.repeat
和itertools.chain
的直观方式。
For larger dataframes, this is likely to be more efficient than a pandorable
method. 对于较大的数据帧,这可能比可
pandorable
方法更有效。
import pandas as pd
import numpy as np
from itertools import chain
df = pd.DataFrame({'first-name':['jan','jay'],'Qty':[2,4]})
lens = df['Qty'].values
res = pd.DataFrame({'Qty': np.repeat(df['Qty'], lens),
'first-name': np.repeat(df['first-name'], lens),
'Count': list(chain.from_iterable(range(1, i+1) for i in lens))})
print(res)
Count Qty first-name
0 1 2 jan
0 2 2 jan
1 1 4 jay
1 2 4 jay
1 3 4 jay
1 4 4 jay
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.