[英]How to split the whole dataset into 4 range based on one column using python
我有一个包含7k records [telecom dataset]
的数据集7k records [telecom dataset]
。
我想基于一个包含1至72个数字的特定列["tenure column"]
将数据集分为4个范围。
需要根据该任期列拆分整个数据,例如:
1至18范围[1-数据集],19至36范围[2-数据集],37至54范围[3-数据集],55至72范围[4-数据集]
我的带有头部的样本数据集(5)
out.head(5)
Out[51]:
customerID Date gender age region SeniorCitizen Partner \
0 9796-BPKIW 1/2/2008 1 57 1 1 0
1 4298-OYIFC 1/4/2008 1 50 2 0 1
2 9606-PBKBQ 1/6/2008 1 85 0 1 1
3 1704-NRWYE 1/9/2008 0 55 0 1 0
4 9758-MFWGD 1/6/2008 0 52 1 1 1
Dependents tenure PhoneService ... DeviceProtection TechSupport \
0 0 8 1 ... 0 0
1 0 15 1 ... 1 1
2 0 32 1 ... 0 0
3 0 9 1 ... 0 0
4 1 48 0 ... 0 0
StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod \
0 0 0 0 1 1
1 1 1 0 1 2
2 0 1 0 1 2
3 1 0 0 1 2
4 0 0 1 0 0
MonthlyCharges TotalCharges Churn
0 69.95 562.70 0
1 103.45 1539.80 0
2 85.00 2642.05 1
3 80.85 751.65 1
4 29.90 1388.75 0
使用熊猫可以轻松地执行此操作。
import pandas as pd
df = pd.read_csv('your_dataset_file.csv', sep=',', header=0)
# Sort it according to tenure
df.sort_values(by=['tenure'], inplace=True)
# Create bin edges
step_size = int(df.tenure.max()/4)
bin_edges = list(range(0,df.tenure.max()+step_size, step_size))
lbls = ['a','b','c','d']
df['bin'] = pd.cut(df.tenure,bin_edges, labels= lbls)
# Create separate dataframes from it
df1 = df[df.bin == 'a']
df2 = df[df.bin == 'b']
df3 = df[df.bin == 'c']
df4 = df[df.bin == 'd']
我将创建数据集列表
dflist = [df[df["tenure column"].isin(range(i*18 + 1,(i+1)*18+1))] for i in range(4)]
易于理解的代码
i = 1
m = 0
out["tenure column"] = out["tenure column"].astype(int)
df = [None]*4
while i<72:
df[m] = out[(out["tenure column"]>=i) & (out["tenure column"]<=(i+17))]
m += 1
i += 18
希望这能解决您的问题
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.