[英]Replace pandas column with sorted index
我有一個示例 DF,試圖用升序索引替換列值列表:
東風:
df = pd.DataFrame(np.random.randint(0,10,size=(7,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Mango","Mango","Mango","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
a b c d1 d2 date
0 2 7 9 Apple Orange 2002-01-01
1 6 0 9 Mango lemon 2002-01-01
2 8 0 0 Apple lemon 2002-01-01
3 4 4 4 Mango Orange 2002-01-01
4 5 0 8 Mango lemon 2002-02-01
5 6 1 6 Mango Orange 2002-02-01
6 7 2 7 Apple lemon 2002-02-01
步驟1:
Group the DF by "date" column, sample group on "2002-01-01"
a b c d1 d2 date
0 2 7 9 Apple Orange 2002-01-01
1 6 0 9 Mango lemon 2002-01-01
2 8 0 0 Apple lemon 2002-01-01
3 4 4 4 Mango Orange 2002-01-01
第2步:
在該組中,將列["d1","d2"]
的值替換為基於c
的排序平均值的索引(不是 DF 索引)。
例如在上面的組中mean(c, d1="Apple") = [9+0]/2 => 4.5
和mean(c, d1="Mango") = [9+4]/2 => 6.5
所以ascending sorted index
是Apple:0
和Mango:1
所以列d1
的值將被替換如下:
a b c d1 d2 date
0 2 7 9 0 Orange 2002-01-01
1 6 0 9 1 lemon 2002-01-01
2 8 0 0 0 lemon 2002-01-01
3 4 4 4 1 Orange 2002-01-01
將此應用於整個df
。 我有一種遍歷組和每一行的蠻力方法,任何關於更多基於pandas
的解決方案的建議都將有助於提高效率。
這是您在 d1 列中尋找的內容嗎? 您也可以對 d2 應用一些類似的技術。 雖然它不是最優雅的解決方案。
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(7,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Mango","Mango","Mango","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
df['mean_value'] = df.groupby(['date', 'd1'])['c'].transform(lambda x: np.mean(x))
df['rank_value'] = (df.groupby(['date'])['mean_value'].rank(ascending=True, method='dense') - 1).astype(int)
df['d1'] = df['rank_value']
df.drop(labels=['rank_value', 'mean_value'], axis=1, inplace=True)
df
a b c d1 d2 date
0 3 1 4 1 Orange 2002-01-01
1 9 7 5 0 lemon 2002-01-01
2 9 9 5 1 lemon 2002-01-01
3 8 1 2 0 Orange 2002-01-01
4 8 0 1 0 lemon 2002-02-01
5 1 8 3 0 Orange 2002-02-01
6 8 0 4 1 lemon 2002-02-01
您可以使用pivot_table
和groupby.rank
創建排名。 之后使用map
重新分配值
df1 = df.pivot_table('c', ['date','d1']).groupby(level=0).rank(method='dense')-1
df['d1'] = df[['date','d1']].agg(tuple, axis=1).map(df1.c).astype('int')
Out[255]:
a b c d1 d2 date
0 2 7 9 0 Orange 2002-01-01
1 6 0 9 1 lemon 2002-01-01
2 8 0 0 0 lemon 2002-01-01
3 4 4 4 1 Orange 2002-01-01
4 5 0 8 0 lemon 2002-02-01
5 6 1 6 0 Orange 2002-02-01
6 7 2 7 0 lemon 2002-02-01
注意:組2002-02-01
對於Mango
和Apple
具有相同的平均值7
,因此排名均為0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.