简体   繁体   English

Pandas中列的扩展排名

[英]Expanding ranking of column in Pandas

Consider a sample DF:考虑一个样本 DF:

df = pd.DataFrame(np.random.randint(0,60,size=(10,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Apple","Mango","Mango","Apple","Mango","Apple","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon","Orange","lemon","Orange"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
df

    a   b   c     d1       d2    date
0   7   1   19  Apple   Orange  2002-01-01
1   3   7   17  Mango   lemon   2002-01-01
2   9   6   4   Apple   lemon   2002-01-01
3   0   5   51  Pine    Orange  2002-01-01
4   4   6   8   Apple   lemon   2002-02-01
5   4   3   1   Mango   Orange  2002-02-01
6   2   2   14  Apple   lemon   2002-02-01
7   5   15  10  Mango   Orange  2002-01-01
8   1   2   10  Pine    lemon   2002-02-01
9   2   1   12  Apple   Orange  2002-02-01

Trying to replace column d1 with rank based on Group by column d1 and mean of column c in expanding manner.尝试以扩展方式将d1 d1c列的mean的排名。 For example, consider the following first 5 rows:例如,考虑以下前 5 行:

  1. First row by default the value at index 0 , ie Apple will be replaced with 0第一行默认为索引0处的值,即Apple将替换为0

  2. Second row, index 1 , the value Mango should be replaced by 0 , because considering only the first 2 rows of the DF GROUPED_MEAN for Apple will be 19 and Mango will be 17, so the value Mango at index 1 should be replaced by rank 0 since it has lower grouped mean.第二行,索引1 ,值Mango应替换为0 ,因为仅考虑Apple的 DF GROUPED_MEAN的前2行将是 19 并且Mango将为 17,因此索引1处的值 Mango 应替换为 rank 0因为它具有较低的分组平均值。

  3. Third row, index 2 , the value Apple should be replaced by 0 , because considering only the first 3 rows of the DF GROUPED_MEAN for Apple will be (19+4)/2 and Mango will be 17, so the value Apple at index 2 should be replaced by rank 0 since it has lower grouped mean第三行,索引2 ,值Apple应替换为0 ,因为仅考虑Apple的 DF GROUPED_MEAN的前3行将是(19+4)/2并且Mango将是 17,因此索引2处的值 Apple应该用等级0代替,因为它具有较低的分组平均值

  4. Fourth row, index 3 , the value Pine should be replaced by 2 , because considering only the first 4 rows of the DF GROUPED_MEAN for Apple will be (19+4)/2 and Mango will be 17, Pine will be 51, since Pine has the highest grouped mean of all the 3 categories- [Apple, Mango, Pine] , Pine will be given rank 2.第四行,索引3 ,值Pine应替换为2 ,因为仅考虑Apple的 DF GROUPED_MEAN的前4行将是(19+4)/2并且Mango将是 17, Pine将是 51,因为 Pine在所有 3 个类别中具有最高的分组平均值 - [Apple, Mango, Pine] ,Pine 将获得排名 2。

  5. Fifth row, index 4 , the value Apple should be replaced by 0 , because considering only the first 5 rows of the DF GROUPED_MEAN for Apple will be (19+4+8)/3 and Mango will be 17, Pine will be 51, since Apple has the lowest grouped mean of all the 3 - Apple, Mango, Pine , Apple will be given rank 0.第五行,索引4 ,值Apple应替换为0 ,因为仅考虑Apple的 DF GROUPED_MEAN的前5行将是(19+4+8)/3Mango将是 17, Pine将是 51,由于 Apple 在所有 3- Apple, Mango, Pine中的分组平均值最低,因此 Apple 将被评为 0 级。

Expected Value of column d1: d1 列的预期值:

0
0
0
2
0
0
1
0
2
1

Iterative Approach:迭代方法:

def expanding(data,cols):

    copy_df = data.copy(deep=True)
    for i in range(len(copy_df)):
       if i==0:
          copy_df.loc[i,cols]=0
       else:
          op = group_processor(data[:i+1],cols,i)
          copy_df.loc[i,cols]=op
    return copy_df

def group_processor(cut_df,cols,i):

    op=[]
    for each_col in cols:
       temp = cut_df.pivot_table("c",[each_col]).rank(method="dense")-1
       value = cut_df.loc[i,each_col]
       temp = temp.reset_index()
       final_value = temp.loc[temp[each_col]==value,"c"]
       op.append(final_value.values[0])

    return op

expanding(df,["d1"])

I am able do this iteratively through every row of the DF, but the performance is poor for large DFs so any suggestions on a more pandas based approach will be great.我可以通过 DF 的每一行迭代地执行此操作,但大型 DF 的性能很差,因此任何关于更多基于 pandas 的方法的建议都会很棒。

Use Series.expanding with minimum window size of 1 on column c , and use a custom lambda function exp . Use Series.expanding with minimum window size of 1 on column c , and use a custom lambda function exp . In this lambda function we use Series.groupby to group the exapnding window w by the column d1 in the original dataframe and transform using mean , finally using Series.rank with method='dense' we calculate the rank: In this lambda function we use Series.groupby to group the exapnding window w by the column d1 in the original dataframe and transform using mean , finally using Series.rank with method='dense' we calculate the rank:

exp = lambda w: w.groupby(df['d1']).transform('mean').rank(method='dense').iat[-1]
df['d1_new'] = df['c'].expanding(1).apply(exp).sub(1).astype(int)

Result:结果:

# print(df)

   a   b   c     d1      d2        date  d1_new
0  7   1  19  Apple  Orange  2002-01-01       0
1  3   7  17  Mango   lemon  2002-01-01       0
2  9   6   4  Apple   lemon  2002-01-01       0
3  0   5  51   Pine  Orange  2002-01-01       2
4  4   6   8  Apple   lemon  2002-02-01       0
5  4   3   1  Mango  Orange  2002-02-01       0
6  2   2  14  Apple   lemon  2002-02-01       1
7  5  15  10  Mango  Orange  2002-01-01       0
8  1   2  10   Pine   lemon  2002-02-01       2
9  2   1  12  Apple  Orange  2002-02-01       1

Performance:表现:

df.shape
(1000, 7)

%%timeit
exp = lambda w: w.groupby(df['d1']).transform('mean').rank(method='dense').iat[-1]
df['d1_new'] = df['c'].expanding(1).apply(exp).sub(1).astype(int)
3.15 s ± 305 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
expanding(df,["d1"]) # your method
11.9 s ± 449 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM