简体   繁体   English

Pandas:在数据帧的组中添加行

[英]Pandas: Add rows in the groups of a dataframe

I have a data frame as follows:我有一个数据框如下:

df = pd.DataFrame({"date": [1,2,5,6,2,3,4,5,1,3,4,5,6,1,2,3,4,5,6],
               "variable": ["A","A","A","A","B","B","B","B","C","C","C","C","C","D","D","D","D","D","D"]})
   date variable
0   1   A
1   2   A
2   5   A
3   6   A
4   2   B
5   3   B
6   4   B
7   5   B
8   1   C
9   3   C
10  4   C
11  5   C
12  6   C
13  1   D
14  2   D
15  3   D
16  4   D
17  5   D
18  6   D

In this data frame, there are 4 values in the variable column: A, B, C, D. My goal is that each of the variables needs to contain 1 to 6 dates in the date column.在这个数据框中, variable列中有 4 个值:A、B、C、D。我的目标是每个变量都需要在date列中包含 1 到 6 个日期。

But currently, a few values in the date column are missing for some variable .但是目前,某些variable在日期列中缺少一些variable I tried grouping them and filling each value with a counter but sometimes there are more than one dates missing (For example, in variable A, the dates 4 and 5 are missing).我尝试将它们分组并用计数器填充每个值,但有时缺少多个日期(例如,在variable A 中,缺少日期 4 和 5)。 Also, the counter made my code terribly slow as I have a couple of thousand of rows.此外,计数器使我的代码非常慢,因为我有几千行。

Is there a faster and smarter way to do this without using a counter?有没有更快更聪明的方法来做到这一点而不使用计数器?

The desired output should be as follows:所需的输出应如下所示:

date    variable
0   1   A
1   2   A
2   3   A
3   4   A
4   5   A
5   6   A
6   1   B
7   2   B
8   3   B
9   4   B
10  5   B
11  6   B
12  1   C
13  2   C
14  3   C
15  4   C
16  5   C
17  6   C
18  1   D
19  2   D
20  3   D
21  4   D
22  5   D
23  6   D

itertools.product

from itertools import product

pd.DataFrame([*product(
    range(df.date.min(), df.date.max() + 1),
    sorted({*df.variable})
)], columns=df.columns)

    date variable
0      1        A
1      1        B
2      1        C
3      1        D
4      2        A
5      2        B
6      2        C
7      2        D
8      3        A
9      3        B
10     3        C
11     3        D
12     4        A
13     4        B
14     4        C
15     4        D
16     5        A
17     5        B
18     5        C
19     5        D
20     6        A
21     6        B
22     6        C
23     6        D

Using grpupby + reindex 使用grpupby + reindex

df.groupby('variable', as_index=False).apply(
           lambda g: g.set_index('date').reindex([1,2,3,4,5,6]).ffill().bfill())
           .reset_index(level=1)

Output: 输出:

 date   variable
0   1   A
0   2   A
0   3   A
0   4   A
0   5   A
0   6   A
1   1   B
1   2   B
1   3   B
1   4   B
1   5   B
1   6   B
2   1   C
2   2   C
2   3   C
2   4   C
2   5   C
2   6   C
3   1   D
3   2   D
3   3   D
3   4   D
3   5   D
3   6   D

这更像是一种解决方法,但它应该有效

df.groupby(by=['variable']).agg({'date': range(6)}).explode('date')

You can do something like this: 你可以这样做:

var=df['variable'].unique().tolist()
i=0
for j in var:
    long = df.loc[df['variable']==var[i]].shape[0]
    while long <6:
        df.loc[df.shape[0]]=[long,var[i]]
        long=long+1
        df=df.sort_values(['variable','date']).reset_index(drop=True)
    df.loc[df['variable']==var[i],'date']=list(range(1,7))
    i+=1
df

Output: 输出:

    date variable
0   1     A
1   2     A
2   3     A
3   4     A
4   5     A
5   6     A
6   1     B
7   2     B
8   3     B
9   4     B
10  5     B
11  6     B
12  1     C
13  2     C
14  3     C  
15  4     C
16  5     C
17  6     C
18  1     D
19  2     D
20  3     D
21  4     D
22  5     D
23  6     D

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM