简体   繁体   English

按列中的值“展开”pandas数据框

[英]“Expand” pandas dataframe by values in column

Lets say I start with a dataframe that has some data and a column of quantities: 假设我从一个包含一些数据和一列数量的数据框开始:

In:  df=pd.DataFrame({'first-name':['Jan','Leilani'],'Qty':[2,4]})

Out: Qty    first-name
     2      Jan
     4      Leilani

I want to create a dataframe that copies and labels the data into new lines a number of times equal to the quantity on each line. 我想创建一个数据帧,将数据复制并标记为新行,其次数等于每行的数量。 Here is what the output should look like: 这是输出应该是什么样子:

Qty     first-name  position
2       Jan         1
2       Jan         2
4       Leilani     1
4       Leilani     2
4       Leilani     3
4       Leilani     4

I can do this using python like so: 我可以使用python这样做:

l=[]
x=0

for idx in df.index:
    x=0
    for _ in range(df.loc[idx]['Qty']):
        x+=1
        tempSrs=df.loc[idx]
        tempSrs['position']=x
        l.append(tempSrs)

outDf=pd.DataFrame(l)

This is very slow. 这很慢。 Is there a way to do this using pandas functions? 有没有办法使用pandas功能? This is effectively an "unpivot", which in pandas is "melt", but I wasn't able to figure out how to use the melt function to accomplish this. 这实际上是一个“不透明”,在熊猫中是“融化”,但我无法弄清楚如何使用融化功能来实现这一目标。

Thanks, 谢谢,

With repeat and cumcount 随着repeatcumcount

Newdf=df.reindex(df.index.repeat(df.Qty))
Newdf['position']=Newdf.groupby(level=0).cumcount()+1
Newdf
Out[931]: 
   Qty first-name position
0    2        jan        1
0    2        jan        2
1    4        jay        1
1    4        jay        2
1    4        jay        3
1    4        jay        4

This uses almost identical concepts as Wen. 这使用了与文几乎完全相同的概念。

The differences are: 不同之处是:

  1. loc instead of reindex (same thing) loc而不是reindex (同样的事情)
  2. assign instead of = assignment ( assign produces a copy) assign而不是=赋值( assign生成副本)
  3. Pass a lambda to assign to embed groupby logic 传递一个lambdaassign嵌入groupby逻辑

df.loc[df.index.repeat(df.Qty)].assign(
    position=lambda d: d.groupby('first-name').cumcount() + 1
)

   Qty first-name  position
0    2        jan         1
0    2        jan         2
1    4        jay         1
1    4        jay         2
1    4        jay         3
1    4        jay         4

Construct with np.arange 使用np.arange

q = df.Qty.values
r = np.arange(q.sum()) - np.append(0, q[:-1]).cumsum().repeat(q) + 1
df.loc[df.index.repeat(q)].assign(position=r)

   Qty first-name  position
0    2        jan         1
0    2        jan         2
1    4        jay         1
1    4        jay         2
1    4        jay         3
1    4        jay         4

Here is an intuitive way using numpy.repeat and itertools.chain . 这是使用numpy.repeatitertools.chain的直观方式。

For larger dataframes, this is likely to be more efficient than a pandorable method. 对于较大的数据帧,这可能比可pandorable方法更有效。

import pandas as pd
import numpy as np
from itertools import chain

df = pd.DataFrame({'first-name':['jan','jay'],'Qty':[2,4]})

lens = df['Qty'].values

res = pd.DataFrame({'Qty': np.repeat(df['Qty'], lens),
                    'first-name': np.repeat(df['first-name'], lens),
                    'Count': list(chain.from_iterable(range(1, i+1) for i in lens))})

print(res)

   Count  Qty first-name
0      1    2        jan
0      2    2        jan
1      1    4        jay
1      2    4        jay
1      3    4        jay
1      4    4        jay

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM