[英]Expanding ranking of column in Pandas
Consider a sample DF:考虑一个样本 DF:
df = pd.DataFrame(np.random.randint(0,60,size=(10,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Apple","Mango","Mango","Apple","Mango","Apple","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon","Orange","lemon","Orange"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
df
a b c d1 d2 date
0 7 1 19 Apple Orange 2002-01-01
1 3 7 17 Mango lemon 2002-01-01
2 9 6 4 Apple lemon 2002-01-01
3 0 5 51 Pine Orange 2002-01-01
4 4 6 8 Apple lemon 2002-02-01
5 4 3 1 Mango Orange 2002-02-01
6 2 2 14 Apple lemon 2002-02-01
7 5 15 10 Mango Orange 2002-01-01
8 1 2 10 Pine lemon 2002-02-01
9 2 1 12 Apple Orange 2002-02-01
Trying to replace column d1
with rank based on Group by column d1
and mean
of column c
in expanding manner.尝试以扩展方式将
d1
d1
和c
列的mean
的排名。 For example, consider the following first 5 rows:例如,考虑以下前 5 行:
First row by default the value at index 0
, ie Apple
will be replaced with 0
第一行默认为索引
0
处的值,即Apple
将替换为0
Second row, index 1
, the value Mango
should be replaced by 0
, because considering only the first 2
rows of the DF GROUPED_MEAN
for Apple
will be 19 and Mango
will be 17, so the value Mango at index 1
should be replaced by rank 0 since it has lower grouped mean.第二行,索引
1
,值Mango
应替换为0
,因为仅考虑Apple
的 DF GROUPED_MEAN
的前2
行将是 19 并且Mango
将为 17,因此索引1
处的值 Mango 应替换为 rank 0因为它具有较低的分组平均值。
Third row, index 2
, the value Apple
should be replaced by 0
, because considering only the first 3
rows of the DF GROUPED_MEAN
for Apple
will be (19+4)/2
and Mango
will be 17, so the value Apple at index 2
should be replaced by rank 0
since it has lower grouped mean第三行,索引
2
,值Apple
应替换为0
,因为仅考虑Apple
的 DF GROUPED_MEAN
的前3
行将是(19+4)/2
并且Mango
将是 17,因此索引2
处的值 Apple应该用等级0
代替,因为它具有较低的分组平均值
Fourth row, index 3
, the value Pine
should be replaced by 2
, because considering only the first 4
rows of the DF GROUPED_MEAN
for Apple
will be (19+4)/2
and Mango
will be 17, Pine
will be 51, since Pine has the highest grouped mean of all the 3 categories- [Apple, Mango, Pine]
, Pine will be given rank 2.第四行,索引
3
,值Pine
应替换为2
,因为仅考虑Apple
的 DF GROUPED_MEAN
的前4
行将是(19+4)/2
并且Mango
将是 17, Pine
将是 51,因为 Pine在所有 3 个类别中具有最高的分组平均值 - [Apple, Mango, Pine]
,Pine 将获得排名 2。
Fifth row, index 4
, the value Apple
should be replaced by 0
, because considering only the first 5
rows of the DF GROUPED_MEAN
for Apple
will be (19+4+8)/3
and Mango
will be 17, Pine
will be 51, since Apple has the lowest grouped mean of all the 3 - Apple, Mango, Pine
, Apple will be given rank 0.第五行,索引
4
,值Apple
应替换为0
,因为仅考虑Apple
的 DF GROUPED_MEAN
的前5
行将是(19+4+8)/3
, Mango
将是 17, Pine
将是 51,由于 Apple 在所有 3- Apple, Mango, Pine
中的分组平均值最低,因此 Apple 将被评为 0 级。
Expected Value of column d1: d1 列的预期值:
0
0
0
2
0
0
1
0
2
1
Iterative Approach:迭代方法:
def expanding(data,cols):
copy_df = data.copy(deep=True)
for i in range(len(copy_df)):
if i==0:
copy_df.loc[i,cols]=0
else:
op = group_processor(data[:i+1],cols,i)
copy_df.loc[i,cols]=op
return copy_df
def group_processor(cut_df,cols,i):
op=[]
for each_col in cols:
temp = cut_df.pivot_table("c",[each_col]).rank(method="dense")-1
value = cut_df.loc[i,each_col]
temp = temp.reset_index()
final_value = temp.loc[temp[each_col]==value,"c"]
op.append(final_value.values[0])
return op
expanding(df,["d1"])
I am able do this iteratively through every row of the DF, but the performance is poor for large DFs so any suggestions on a more pandas based approach will be great.我可以通过 DF 的每一行迭代地执行此操作,但大型 DF 的性能很差,因此任何关于更多基于 pandas 的方法的建议都会很棒。
Use Series.expanding
with minimum window size of 1
on column c
, and use a custom lambda function exp
. Use
Series.expanding
with minimum window size of 1
on column c
, and use a custom lambda function exp
. In this lambda function we use Series.groupby
to group the exapnding window w
by the column d1
in the original dataframe and transform
using mean
, finally using Series.rank
with method='dense'
we calculate the rank: In this lambda function we use
Series.groupby
to group the exapnding window w
by the column d1
in the original dataframe and transform
using mean
, finally using Series.rank
with method='dense'
we calculate the rank:
exp = lambda w: w.groupby(df['d1']).transform('mean').rank(method='dense').iat[-1]
df['d1_new'] = df['c'].expanding(1).apply(exp).sub(1).astype(int)
Result:结果:
# print(df)
a b c d1 d2 date d1_new
0 7 1 19 Apple Orange 2002-01-01 0
1 3 7 17 Mango lemon 2002-01-01 0
2 9 6 4 Apple lemon 2002-01-01 0
3 0 5 51 Pine Orange 2002-01-01 2
4 4 6 8 Apple lemon 2002-02-01 0
5 4 3 1 Mango Orange 2002-02-01 0
6 2 2 14 Apple lemon 2002-02-01 1
7 5 15 10 Mango Orange 2002-01-01 0
8 1 2 10 Pine lemon 2002-02-01 2
9 2 1 12 Apple Orange 2002-02-01 1
Performance:表现:
df.shape
(1000, 7)
%%timeit
exp = lambda w: w.groupby(df['d1']).transform('mean').rank(method='dense').iat[-1]
df['d1_new'] = df['c'].expanding(1).apply(exp).sub(1).astype(int)
3.15 s ± 305 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
expanding(df,["d1"]) # your method
11.9 s ± 449 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.