簡體   English   中英

基於預定義百分比組的 DataFrame 中的訂購值

[英]Ordering value in DataFrame based on pre-defined percentage groups

我有以下樣品 dataframe

df = pd.DataFrame({'ID': [11, 12, 16, 19, 14, 9, 4, 13, 6, 18], 
                   'Value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})

我使用的Value的百分比添加到其中

df['Percent Value'] = df.reset_index()['Value'].rank(method='dense', pct=True)

這使

    ID  Value   Percent Value
0   11   1         0.1
1   12   2         0.2
2   16   3         0.3
3   19   4         0.4
4   14   5         0.5
5   9    6         0.6
6   4    7         0.7
7   13   8         0.8
8   6    9         0.9
9   18   10        1.0

然后我定義以下百分比數組

percentage = np.array([30, 40, 60, 90, 100])

我想要的 output 下面是一個Order列,其中,根據percentage數組,高達 30% 的Value獲得訂單 1,30-40% 獲得訂單 2,40-60% 獲得訂單 3,60-90%得到訂單 4,90-100% 得到訂單 5。

所以最終的 output 是

    ID  Value   Percent Value   Order
0   11   1          0.1           1
1   12   2          0.2           1
2   16   3          0.3           1
3   19   4          0.4           2
4   14   5          0.5           3
5   9    6          0.6           3
6   4    7          0.7           4
7   13   8          0.8           4
8   6    9          0.9           4
9   18   10         1.0           5

我可以通過遍歷Percent Value列並獲取percentage數組中第一個值的索引來做到這一點,該數組返回True for <條件。 我想知道在 pandas 中是否有一些更簡單的方法,我可以傳遞percentage數組作為比較Percent Value列的參數。

我嘗試使用列表理解

df['Order'] = [1 + np.where(v <= percentage)[0][0] for v in df['Percent Value']]

Pandas cut在這里可以提供幫助; 我在開始時包含了一個 0 以獲得小於 0.3 的范圍

df['Percent Value'] = df.Value.rank(method='dense',pct=True)

percentage = np.array([30, 40, 60, 90, 100])


#get the values in fraction, since percent value is in that format
percentage = percentage/100


#insert a 0 at the start to get the boundary,
#so u'll have a 0 to 0.3 bin, 0.3 to 0.4, 0.4 to 0.6, and so on
#the final value will have the labels based on the bins
df['Order'] = pd.cut(df['Percent Value'],bins=np.insert(percentage,0,0), labels = [1,2,3,4,5])


    ID  Value   Percent Value   Order
0   11    1      0.1            1
1   12    2      0.2            1
2   16    3      0.3            1
3   19    4      0.4            2
4   14    5      0.5            3
5   9     6      0.6            3
6   4     7      0.7            4
7   13    8      0.8            4
8   6     9      0.9            4
9   18    10     1.0            5

另一種方法。

>>> df
   ID  Value  Percent Value
0  11      1            0.1
1  12      2            0.2
2  16      3            0.3
3  19      4            0.4
4  14      5            0.5
5   9      6            0.6
6   4      7            0.7
7  13      8            0.8
8   6      9            0.9
9  18     10            1.0
>>>
>>> percentage = np.array([30, 40, 60, 90, 100])
>>> pd.cut(df['Percent Value'], bins=[-np.inf, *percentage/100], labels=range(1, len(percentage) + 1))
0    1
1    1
2    1
3    2
4    3
5    3
6    4
7    4
8    4
9    5
Name: Percent Value, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

pandas 1.0.3

您可以使用np.select

In [1695]: import numpy as np

In [1702]: conditions = [df['Percent Value'].le(0.30),\ 
  ...: (df['Percent Value'].gt(0.30) & df['Percent Value'].le(0.40)),\ 
  ...: (df['Percent Value'].gt(0.40) & df['Percent Value'].le(0.60)),\ 
  ...: (df['Percent Value'].gt(0.60)& df['Percent Value'].le(0.90)),\ 
  ...: (df['Percent Value'].gt(0.90) & df['Percent Value'].le(1))]

In [1694]: choices = [1,2,3,4,5]

In [1697]: df['Order'] = np.select(conditions, choices)

In [1698]: df 
Out[1698]:
   ID  Value  Percent Value  Order
0  11      1           0.10      1
1  12      2           0.20      1
2  16      3           0.30      1
3  19      4           0.40      2
4  14      5           0.50      3
5   9      6           0.60      3
6   4      7           0.70      4
7  13      8           0.80      4
8   6      9           0.90      4
9  18     10           1.00      5

只是所有答案的性能比較:

@sammywemmy 的回答:

In [1723]: %timeit pd.cut(df['Percent Value'],bins=np.insert(percentage,0,0), labels = [1,2,3,4,5])
1.21 ms ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@timgeb 的回答:

In [1725]: %timeit pd.cut(df['Percent Value'], bins=[-np.inf, *percentage/100], labels=range(1, len(percentage) + 1)) 
1.02 ms ± 64.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

我的答案:

In [1726]: %timeit np.select(conditions, choices) 
86 µs ± 3.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

我知道明確寫出條件很麻煩,但是numpy的性能非常高。 請檢查每個答案共享的上述指標。

使用 np.where 作為 CASE 語句:

# Initialise packages in session: 
import pandas as pd
import numpy as np

# Create data: df => data.frame
df = pd.DataFrame({'ID': [11, 12, 16, 19, 14, 9, 4, 13, 6, 18], 
                   'Value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})

# Percentage Dense ranke the value vector: Percent Value => float vector
df['Percent Value'] = df.reset_index()['Value'].rank(method = 'dense', pct = True)

# Conditionally define order vector: order => integer vector
df['order'] = np.where(
     df['Percent Value'].between(.0, .30, inclusive = True), 
    1, 
     np.where(
        df['Percent Value'].between(.31, .40, inclusive = True), 2,
         np.where(
             df['Percent Value'].between(.41, .59, inclusive = True), 3, 
             np.where(
                 df['Percent Value'].between(.60, .90, inclusive = True), 4, 5
                 )
             )
         )
     )

# Display data.frame: df => stdout (console)
print(df)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM