[英]Cluster X, Y values into Sector and Plot in pandas, pandas groupby and or scikit
I have a data frame as shown below 我有一个数据框,如下所示
X Y Sector Plot
5 3 SE1 P2
3 3 SE1 P1
6 7 SE1 P3
1 6 SE1 P3
2 1 SE1 P1
7 3 SE1 P2
17 20 SE2 P1
23 22 SE2 P1
27 28 SE2 P3
31 25 SE2 P3
25 25 SE2 P2
31 31 SE2 P2
17 25 SE2 P4
23 31 SE2 P4
From the above data, I would like to estimate the min and max values of X and Y for each Sector, Plot combination. 根据以上数据,我想估计每个扇区图组合的X和Y的最小值和最大值。
The expected output of the data frame as shown below. 数据帧的预期输出如下所示。
Sector_Plot Xmin Xmax Ymin Ymax
SE1_P1 2 3 1 3
SE1_P2 5 7 3 3
SE1_P3 1 6 6 7
SE2_P1 17 23 20 22
SE2_P2 25 31 25 25
SE2_P3 27 31 25 31
SE2_P4 17 23 25 31
From the above rule if we get new X, Y we should be able to Predict Sector_Plot as shown below. 根据上面的规则,如果我们得到新的X,Y,我们应该能够预测Sector_Plot,如下所示。
X Y Estimated_Sector_Plot
2.5 2 SE1_P1
2 1 SE1_P1
3 2 SE1_P1
5 3 SE1_P2
7 3 SE1_P2
6 3 SE1_P2
1 7 SE1_P3
4 6 SE1_P3
2 7 SE1_P3
28 25 SE2_P3
29 31 SE2_P3
18 19 SE2_P1
17 20 SE2_P1
19 22 SE2_P1
30 25 SE2_P2
25 25 SE2_P2
18 26 SE2_P4
17 31 SE2_P4
I tried machine learning method it was a flop. 我尝试过机器学习方法,但失败了。 Is that can be done by any other methods? 可以通过其他任何方法来做到吗?
I am sharing my code below 我在下面分享我的代码
def find_frequent_labels(df, var, rare_perc):
df = df.copy()
tmp = df.groupby(var)['X'].count() / len(df)
return tmp[tmp>rare_perc].index
for var in ['SECTOR']:
frequent_ls = find_frequent_labels(train, var, 0.01)
train[var] = np.where(train[var].isin(frequent_ls), train[var], 'Rare')
test[var] = np.where(test[var].isin(frequent_ls), test[var], 'Rare')
def replace_with_X(train1, test1, var, target):
ordered_labels = train1.groupby([var])[target].mean().sort_values().index
ordinal_label = {k:i for i, k in enumerate(ordered_labels, 0)}
train1['Sec_X'] = train1[var].map(ordinal_label)
test1['Sec_X'] = test1[var].map(ordinal_label)
for var in ['SECTOR']:
replace_with_X(train, test, var, 'X')
def replace_with_Y(train1, test1, var, target):
ordered_labels = train1.groupby([var])[target].mean().sort_values().index
ordinal_label = {k:i for i, k in enumerate(ordered_labels, 0)}
train1['Sec_Y'] = train1[var].map(ordinal_label)
test1['Sec_Y'] = test1[var].map(ordinal_label)
for var in ['SECTOR']:
replace_with_Y(train, test, var, 'Y')
train['Plot_id'] = train['PLOT'].factorize()[0]
category_id_df = train[['PLOT', 'Plot_id']].drop_duplicates().sort_values('Plot_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['Plot_id', 'PLOT']].values)
category_to_id = dict(category_id_df.values)
from sklearn.svm import LinearSVC
model = LinearSVC(C=1.0, class_weight='balanced')
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(train[['X', 'Y', 'Sector_code']], train['Plot_id'], train.index, test_size=0.01, random_state=0)
model.fit(X_train, y_train)
test['Plot_id'] = model.predict(test[['X', 'Y', 'Sector_code']])
Please note that I am very new in machine leaning and pandas 请注意,我对机器学习和熊猫学习非常陌生
This type of task can be solved with vector quantization . 这种类型的任务可以通过矢量量化解决。 Instead of min and max we need the centroids (mean x/y coordinates) of each sector_plot cluster. 而不是最小值和最大值,我们需要每个ector_plot集群的质心(平均x / y坐标)。 Then we get the nearest cluster with scipy.cluster.vq.vq
: 然后,我们使用scipy.cluster.vq.vq
得到最近的集群:
import pandas as pd
from scipy.cluster.vq import vq
df = pd.DataFrame({'X': [ 5, 3, 6, 1, 2, 7, 17, 23, 27, 31, 25, 31, 17, 23],
'Y': [ 3, 3, 7, 6, 1, 3, 20, 22, 28, 25, 25, 31, 25, 31],
'Sector': ['SE1', 'SE1', 'SE1', 'SE1', 'SE1', 'SE1', 'SE2', 'SE2', 'SE2', 'SE2', 'SE2', 'SE2', 'SE2', 'SE2'],
'Plot': ['P2', 'P1', 'P3', 'P3', 'P1', 'P2', 'P1', 'P1', 'P3', 'P3', 'P2', 'P2', 'P4', 'P4']})
df1 = pd.DataFrame({'X': [ 2.5, 2 , 3 , 5 , 7 , 6 , 1 , 4 , 2 , 28 , 29 , 18 , 17 , 19 , 30 , 25 , 18 , 17 ],
'Y': [ 2, 1, 2, 3, 3, 3, 7, 6, 7, 25, 31, 19, 20, 22, 25, 25, 26, 31]})
# prepare given dataframe, get centroids (means)
df['Sector_Plot'] = df.Sector + '_' + df.Plot
df = df.drop(['Sector', 'Plot'],1)
df = df.groupby(['Sector_Plot']).agg(['min', 'max', 'mean']).reset_index()
df.columns = [''.join(col) for col in df.columns]
# find nearest sector_plot for each entry in the other dataframe
res = vq(df1.values, df[['Xmean','Ymean']].values)
df1['Estimated_Sector_Plot'] = df.iloc[res[0]].Sector_Plot.values
Result: 结果:
X Y Estimated_Sector_Plot
0 2.5 2 SE1_P1
1 2.0 1 SE1_P1
2 3.0 2 SE1_P1
3 5.0 3 SE1_P2
4 7.0 3 SE1_P2
5 6.0 3 SE1_P2
6 1.0 7 SE1_P3
7 4.0 6 SE1_P3
8 2.0 7 SE1_P3
9 28.0 25 SE2_P3
10 29.0 31 SE2_P2
11 18.0 19 SE2_P1
12 17.0 20 SE2_P1
13 19.0 22 SE2_P1
14 30.0 25 SE2_P3
15 25.0 25 SE2_P2
16 18.0 26 SE2_P4
17 17.0 31 SE2_P4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.