[英]create column with buckets based on value range in another column python
我有一個樣本 df
一個 | 乙 |
---|---|
X | 30 |
是 | 150 |
Z | 450 |
XX | 300 |
我需要創建另一列 C 根據一些斷點存儲 B 列
斷點 = [50,100,250,350]
一個 | 乙 | C |
---|---|---|
X | 30 | '0-50' |
是 | 150 | '100-250' |
Z | 450 | '>350' |
XX | 300 | '250-350' |
我有以下有效的代碼
def conditions(i):
if i <=50: return '0-50'
if i > 50 and i <=100: return '50-100'
if i > 100 and i <=250: return '100-250'
if i > 250 and i <=350: return '250-350'
if i > 350: return '>350'
df['C']=df['B'].apply(conditions)
但是我想讓breakpts動態。 因此,如果我使用不同的中斷點,例如 [100,250,300,400],代碼應該會根據中斷點自動創建不同的存儲桶。
關於如何做到這一點的任何想法?
正如評論中所指出的, pd.cut()
將成為 go 的方法。 您可以使分手動態化並自己設置:
import pandas as pd
import numpy as np
bins = [0,50, 100,250, 350, np.inf]
labels = ["'0-50'","'50-100'","'100-250'","'250-350'","'>350'"]
df['C'] = pd.cut(df['B'], bins=bins, labels=labels)
還可以查看pandas.qcut
,它是基於分位數的離散化 function。
或者,使用np.select
:
col = 'B'
conditions = [
df[col].between(0,50), # inclusive = True is the default
df[col].between(50,100),
df[col].between(100,250),
df[col].between(250,350),
df[col].ge(350)
]
choices = ["'0-50'","'50-100'","'100-250'","'250-350'","'>350'"]
df["C"] = np.select(conditions, choices, default=np.nan)
兩者都打印:
A B C
0 X 30 '0-50'
1 Y 150 '100-250'
2 Z 450 '>350'
3 XX 300 '250-350'
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.