[英]Pandas pad dataframe groups
I have a dataframe eg:我有一个 dataframe 例如:
my_label value
0 A 1
1 A 85
2 B 65
3 B 41
4 B 21
5 C 3
I want to group by my_label and to pad groups to a certain length modulo and filling by last value.我想按 my_label 分组并将组填充到一定长度模数并按最后一个值填充。 For example if I want to have multiple of 4, it would give:例如,如果我想要 4 的倍数,它会给出:
my_label value
0 A 1
1 A 85
2 A 85
3 A 85
4 B 65
5 B 41
6 B 21
7 B 21
8 C 3
9 C 3
10 C 3
11 C 3
I managed to get a solution that should be working, but for some reason the reindex isn't done at the end of the groups.我设法得到了一个应该有效的解决方案,但由于某种原因,重建索引没有在组结束时完成。
def _pad(group, seq_len):
pad_number = seq_len - (len(group) % seq_len)
if pad_number != seq_len:
group = group.reindex(range(len(group)+pad_number)).ffill()
return group
df = (df.groupby('my_label')
.apply(_pad, (4))
.reset_index(drop = True))
Here is the code to the above DF for testing:以下是上述 DF 的测试代码:
import pandas as pd
df = pd.DataFrame({"my_label":["A","A","B","B","B","C"], "value":[1,85,65,41,21,3]})
You can concatenate per group a dummy DataFrame with the number of missing rows, then ffill
:您可以将每个组连接一个虚拟 DataFrame 与缺失行数,然后ffill
:
N = 4
out = (df
.groupby('my_label', group_keys=False)
.apply(lambda d: pd.concat([d, pd.DataFrame(columns=d.columns,
index=range(N-len(d)))]))
.ffill()
.reset_index(drop=True)
)
or, directly concatenating the last row as many times as needed:或者,根据需要多次直接连接最后一行:
(df
.groupby('my_label', group_keys=False)
.apply(lambda d: pd.concat([d, d.loc[[d.index[-1]]*(N-len(d))]]))
.reset_index(drop=True)
)
output: output:
my_label value
0 A 1
1 A 85
2 A 85
3 A 85
4 B 65
5 B 41
6 B 21
7 B 21
8 C 3
9 C 3
10 C 3
11 C 3
You can simply solve this by creating an index that represents your desired output, aligning that to your existing data, and then forward filling.您可以通过创建一个代表您想要的 output 的索引,将其与您现有的数据对齐,然后向前填充来简单地解决这个问题。
index = pd.MultiIndex.from_product([df['my_label'].unique(), range(4)], names=['my_label', None])
out = (
df.set_index(
['my_label', df.groupby('my_label').cumcount()]
)
.reindex(index, method='ffill')
)
print(out)
value
my_label
A 0 1.0
1 85.0
2 85.0
3 85.0
B 0 65.0
1 41.0
2 21.0
3 21.0
C 0 3.0
1 3.0
2 3.0
3 3.0
def function1(dd:pd.DataFrame):
return dd.loc[dd.index.tolist()+[dd.index.max()]*(4-len(dd))]
df1.groupby('my_label').apply(function1).reset_index(drop=True)
out出去
my_label value
0 A 1
1 A 85
2 A 85
3 A 85
4 B 65
5 B 41
6 B 21
7 B 21
8 C 3
9 C 3
10 C 3
11 C 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.