简体   繁体   English

Pandas 垫 dataframe 组

[英]Pandas pad dataframe groups

I have a dataframe eg:我有一个 dataframe 例如:

  my_label   value
0        A   1
1        A   85
2        B   65
3        B   41
4        B   21
5        C   3

I want to group by my_label and to pad groups to a certain length modulo and filling by last value.我想按 my_label 分组并将组填充到一定长度模数并按最后一个值填充。 For example if I want to have multiple of 4, it would give:例如,如果我想要 4 的倍数,它会给出:

  my_label   value
0        A   1
1        A   85
2        A   85
3        A   85
4        B   65
5        B   41
6        B   21
7        B   21
8        C   3
9        C   3
10       C   3
11       C   3

I managed to get a solution that should be working, but for some reason the reindex isn't done at the end of the groups.我设法得到了一个应该有效的解决方案,但由于某种原因,重建索引没有在组结束时完成。

def _pad(group, seq_len):
    pad_number = seq_len - (len(group) % seq_len)
    if pad_number != seq_len:
        group = group.reindex(range(len(group)+pad_number)).ffill()
    return group
df = (df.groupby('my_label')
        .apply(_pad, (4))
        .reset_index(drop = True))

Here is the code to the above DF for testing:以下是上述 DF 的测试代码:

import pandas as pd
df = pd.DataFrame({"my_label":["A","A","B","B","B","C"], "value":[1,85,65,41,21,3]})

You can concatenate per group a dummy DataFrame with the number of missing rows, then ffill :您可以将每个组连接一个虚拟 DataFrame 与缺失行数,然后ffill

N = 4
out = (df
 .groupby('my_label', group_keys=False)
 .apply(lambda d: pd.concat([d, pd.DataFrame(columns=d.columns,
                                             index=range(N-len(d)))]))
 .ffill()
 .reset_index(drop=True)
)

or, directly concatenating the last row as many times as needed:或者,根据需要多次直接连接最后一行:

(df
 .groupby('my_label', group_keys=False)
 .apply(lambda d: pd.concat([d, d.loc[[d.index[-1]]*(N-len(d))]]))
 .reset_index(drop=True)
)

output: output:

   my_label  value
0         A      1
1         A     85
2         A     85
3         A     85
4         B     65
5         B     41
6         B     21
7         B     21
8         C      3
9         C      3
10        C      3
11        C      3

You can simply solve this by creating an index that represents your desired output, aligning that to your existing data, and then forward filling.您可以通过创建一个代表您想要的 output 的索引,将其与您现有的数据对齐,然后向前填充来简单地解决这个问题。

index = pd.MultiIndex.from_product([df['my_label'].unique(), range(4)], names=['my_label', None])

out = (
    df.set_index(
        ['my_label', df.groupby('my_label').cumcount()]
    )
    .reindex(index, method='ffill')
)

print(out)
            value
my_label         
A        0    1.0
         1   85.0
         2   85.0
         3   85.0
B        0   65.0
         1   41.0
         2   21.0
         3   21.0
C        0    3.0
         1    3.0
         2    3.0
         3    3.0
def function1(dd:pd.DataFrame):
    return dd.loc[dd.index.tolist()+[dd.index.max()]*(4-len(dd))]

df1.groupby('my_label').apply(function1).reset_index(drop=True)

out出去

   my_label  value
0         A      1
1         A     85
2         A     85
3         A     85
4         B     65
5         B     41
6         B     21
7         B     21
8         C      3
9         C      3
10        C      3
11        C      3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM