[英]Ordering value in DataFrame based on pre-defined percentage groups
我有以下样品 dataframe
df = pd.DataFrame({'ID': [11, 12, 16, 19, 14, 9, 4, 13, 6, 18],
'Value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
我使用的Value
的百分比添加到其中
df['Percent Value'] = df.reset_index()['Value'].rank(method='dense', pct=True)
这使
ID Value Percent Value
0 11 1 0.1
1 12 2 0.2
2 16 3 0.3
3 19 4 0.4
4 14 5 0.5
5 9 6 0.6
6 4 7 0.7
7 13 8 0.8
8 6 9 0.9
9 18 10 1.0
然后我定义以下百分比数组
percentage = np.array([30, 40, 60, 90, 100])
我想要的 output 下面是一个Order
列,其中,根据percentage
数组,高达 30% 的Value
获得订单 1,30-40% 获得订单 2,40-60% 获得订单 3,60-90%得到订单 4,90-100% 得到订单 5。
所以最终的 output 是
ID Value Percent Value Order
0 11 1 0.1 1
1 12 2 0.2 1
2 16 3 0.3 1
3 19 4 0.4 2
4 14 5 0.5 3
5 9 6 0.6 3
6 4 7 0.7 4
7 13 8 0.8 4
8 6 9 0.9 4
9 18 10 1.0 5
我可以通过遍历Percent Value
列并获取percentage
数组中第一个值的索引来做到这一点,该数组返回True
for <
条件。 我想知道在 pandas 中是否有一些更简单的方法,我可以传递percentage
数组作为比较Percent Value
列的参数。
我尝试使用列表理解
df['Order'] = [1 + np.where(v <= percentage)[0][0] for v in df['Percent Value']]
Pandas cut在这里可以提供帮助; 我在开始时包含了一个 0 以获得小于 0.3 的范围
df['Percent Value'] = df.Value.rank(method='dense',pct=True)
percentage = np.array([30, 40, 60, 90, 100])
#get the values in fraction, since percent value is in that format
percentage = percentage/100
#insert a 0 at the start to get the boundary,
#so u'll have a 0 to 0.3 bin, 0.3 to 0.4, 0.4 to 0.6, and so on
#the final value will have the labels based on the bins
df['Order'] = pd.cut(df['Percent Value'],bins=np.insert(percentage,0,0), labels = [1,2,3,4,5])
ID Value Percent Value Order
0 11 1 0.1 1
1 12 2 0.2 1
2 16 3 0.3 1
3 19 4 0.4 2
4 14 5 0.5 3
5 9 6 0.6 3
6 4 7 0.7 4
7 13 8 0.8 4
8 6 9 0.9 4
9 18 10 1.0 5
另一种方法。
>>> df
ID Value Percent Value
0 11 1 0.1
1 12 2 0.2
2 16 3 0.3
3 19 4 0.4
4 14 5 0.5
5 9 6 0.6
6 4 7 0.7
7 13 8 0.8
8 6 9 0.9
9 18 10 1.0
>>>
>>> percentage = np.array([30, 40, 60, 90, 100])
>>> pd.cut(df['Percent Value'], bins=[-np.inf, *percentage/100], labels=range(1, len(percentage) + 1))
0 1
1 1
2 1
3 2
4 3
5 3
6 4
7 4
8 4
9 5
Name: Percent Value, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
pandas
1.0.3
您可以使用np.select
:
In [1695]: import numpy as np
In [1702]: conditions = [df['Percent Value'].le(0.30),\
...: (df['Percent Value'].gt(0.30) & df['Percent Value'].le(0.40)),\
...: (df['Percent Value'].gt(0.40) & df['Percent Value'].le(0.60)),\
...: (df['Percent Value'].gt(0.60)& df['Percent Value'].le(0.90)),\
...: (df['Percent Value'].gt(0.90) & df['Percent Value'].le(1))]
In [1694]: choices = [1,2,3,4,5]
In [1697]: df['Order'] = np.select(conditions, choices)
In [1698]: df
Out[1698]:
ID Value Percent Value Order
0 11 1 0.10 1
1 12 2 0.20 1
2 16 3 0.30 1
3 19 4 0.40 2
4 14 5 0.50 3
5 9 6 0.60 3
6 4 7 0.70 4
7 13 8 0.80 4
8 6 9 0.90 4
9 18 10 1.00 5
只是所有答案的性能比较:
@sammywemmy 的回答:
In [1723]: %timeit pd.cut(df['Percent Value'],bins=np.insert(percentage,0,0), labels = [1,2,3,4,5])
1.21 ms ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
@timgeb 的回答:
In [1725]: %timeit pd.cut(df['Percent Value'], bins=[-np.inf, *percentage/100], labels=range(1, len(percentage) + 1))
1.02 ms ± 64.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
我的答案:
In [1726]: %timeit np.select(conditions, choices)
86 µs ± 3.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
我知道明确写出条件很麻烦,但是numpy
的性能非常高。 请检查每个答案共享的上述指标。
使用 np.where 作为 CASE 语句:
# Initialise packages in session:
import pandas as pd
import numpy as np
# Create data: df => data.frame
df = pd.DataFrame({'ID': [11, 12, 16, 19, 14, 9, 4, 13, 6, 18],
'Value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
# Percentage Dense ranke the value vector: Percent Value => float vector
df['Percent Value'] = df.reset_index()['Value'].rank(method = 'dense', pct = True)
# Conditionally define order vector: order => integer vector
df['order'] = np.where(
df['Percent Value'].between(.0, .30, inclusive = True),
1,
np.where(
df['Percent Value'].between(.31, .40, inclusive = True), 2,
np.where(
df['Percent Value'].between(.41, .59, inclusive = True), 3,
np.where(
df['Percent Value'].between(.60, .90, inclusive = True), 4, 5
)
)
)
)
# Display data.frame: df => stdout (console)
print(df)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.