简体   繁体   English

将X,Y值聚类为Pandas,pandas groupby和或scikit中的Sector和Plot

[英]Cluster X, Y values into Sector and Plot in pandas, pandas groupby and or scikit

I have a data frame as shown below 我有一个数据框,如下所示

X    Y     Sector     Plot
5    3     SE1        P2
3    3     SE1        P1
6    7     SE1        P3
1    6     SE1        P3
2    1     SE1        P1
7    3     SE1        P2
17   20    SE2        P1
23   22    SE2        P1
27   28    SE2        P3
31   25    SE2        P3
25   25    SE2        P2
31   31    SE2        P2
17   25    SE2        P4
23   31    SE2        P4

From the above data, I would like to estimate the min and max values of X and Y for each Sector, Plot combination. 根据以上数据,我想估计每个扇区图组合的X和Y的最小值和最大值。

The expected output of the data frame as shown below. 数据帧的预期输出如下所示。

Sector_Plot  Xmin  Xmax  Ymin  Ymax
SE1_P1       2     3     1     3
SE1_P2       5     7     3     3
SE1_P3       1     6     6     7
SE2_P1       17    23    20    22
SE2_P2       25    31    25    25
SE2_P3       27    31    25    31
SE2_P4       17    23    25    31

From the above rule if we get new X, Y we should be able to Predict Sector_Plot as shown below. 根据上面的规则,如果我们得到新的X,Y,我们应该能够预测Sector_Plot,如下所示。

X    Y    Estimated_Sector_Plot
2.5  2    SE1_P1
2    1    SE1_P1
3    2    SE1_P1
5    3    SE1_P2
7    3    SE1_P2
6    3    SE1_P2
1    7    SE1_P3
4    6    SE1_P3
2    7    SE1_P3
28   25   SE2_P3
29   31   SE2_P3
18   19   SE2_P1
17   20   SE2_P1
19   22   SE2_P1
30   25   SE2_P2
25   25   SE2_P2
18   26   SE2_P4
17   31   SE2_P4

I tried machine learning method it was a flop. 我尝试过机器学习方法,但失败了。 Is that can be done by any other methods? 可以通过其他任何方法来做到吗?

I am sharing my code below 我在下面分享我的代码

def find_frequent_labels(df, var, rare_perc):
    df = df.copy()
    tmp = df.groupby(var)['X'].count() / len(df)
    return tmp[tmp>rare_perc].index    
for var in ['SECTOR']:
    frequent_ls = find_frequent_labels(train, var, 0.01)
    train[var] = np.where(train[var].isin(frequent_ls), train[var], 'Rare')
    test[var] = np.where(test[var].isin(frequent_ls), test[var], 'Rare')    
def replace_with_X(train1, test1, var, target):
    ordered_labels = train1.groupby([var])[target].mean().sort_values().index
    ordinal_label = {k:i for i, k in enumerate(ordered_labels, 0)} 
    train1['Sec_X'] = train1[var].map(ordinal_label)
    test1['Sec_X'] = test1[var].map(ordinal_label)    
for var in ['SECTOR']:
    replace_with_X(train, test, var, 'X')    
def replace_with_Y(train1, test1, var, target):
    ordered_labels = train1.groupby([var])[target].mean().sort_values().index
    ordinal_label = {k:i for i, k in enumerate(ordered_labels, 0)} 
    train1['Sec_Y'] = train1[var].map(ordinal_label)
    test1['Sec_Y'] = test1[var].map(ordinal_label)    
for var in ['SECTOR']:
    replace_with_Y(train, test, var, 'Y')    
train['Plot_id'] = train['PLOT'].factorize()[0]
category_id_df = train[['PLOT', 'Plot_id']].drop_duplicates().sort_values('Plot_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['Plot_id', 'PLOT']].values)
category_to_id = dict(category_id_df.values)
from sklearn.svm import LinearSVC
model = LinearSVC(C=1.0, class_weight='balanced')
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(train[['X', 'Y', 'Sector_code']], train['Plot_id'], train.index, test_size=0.01, random_state=0)    
model.fit(X_train, y_train)    
test['Plot_id'] = model.predict(test[['X', 'Y', 'Sector_code']])

Please note that I am very new in machine leaning and pandas 请注意,我对机器学习和熊猫学习非常陌生

This type of task can be solved with vector quantization . 这种类型的任务可以通过矢量量化解决。 Instead of min and max we need the centroids (mean x/y coordinates) of each sector_plot cluster. 而不是最小值和最大值,我们需要每个ector_plot集群的质心(平均x / y坐标)。 Then we get the nearest cluster with scipy.cluster.vq.vq : 然后,我们使用scipy.cluster.vq.vq得到最近的集群:

import pandas as pd
from scipy.cluster.vq import vq

df = pd.DataFrame({'X': [ 5,  3,  6,  1,  2,  7, 17, 23, 27, 31, 25, 31, 17, 23],
                   'Y': [ 3,  3,  7,  6,  1,  3, 20, 22, 28, 25, 25, 31, 25, 31],
                   'Sector': ['SE1', 'SE1', 'SE1', 'SE1', 'SE1', 'SE1', 'SE2', 'SE2', 'SE2', 'SE2', 'SE2', 'SE2', 'SE2', 'SE2'],
                   'Plot': ['P2', 'P1', 'P3', 'P3', 'P1', 'P2', 'P1', 'P1', 'P3', 'P3', 'P2', 'P2', 'P4', 'P4']})

df1 = pd.DataFrame({'X': [ 2.5,  2 ,  3 ,  5 ,  7 ,  6 ,  1 ,  4 ,  2 , 28 , 29 , 18 , 17 , 19 , 30 , 25 , 18 , 17 ],
                   'Y': [ 2,  1,  2,  3,  3,  3,  7,  6,  7, 25, 31, 19, 20, 22, 25, 25, 26, 31]})

# prepare given dataframe, get centroids (means)
df['Sector_Plot'] = df.Sector + '_' + df.Plot
df = df.drop(['Sector', 'Plot'],1)
df = df.groupby(['Sector_Plot']).agg(['min', 'max', 'mean']).reset_index()
df.columns = [''.join(col) for col in df.columns]

# find nearest sector_plot for each entry in the other dataframe
res = vq(df1.values, df[['Xmean','Ymean']].values)
df1['Estimated_Sector_Plot'] = df.iloc[res[0]].Sector_Plot.values

Result: 结果:

       X   Y Estimated_Sector_Plot
0    2.5   2                SE1_P1
1    2.0   1                SE1_P1
2    3.0   2                SE1_P1
3    5.0   3                SE1_P2
4    7.0   3                SE1_P2
5    6.0   3                SE1_P2
6    1.0   7                SE1_P3
7    4.0   6                SE1_P3
8    2.0   7                SE1_P3
9   28.0  25                SE2_P3
10  29.0  31                SE2_P2
11  18.0  19                SE2_P1
12  17.0  20                SE2_P1
13  19.0  22                SE2_P1
14  30.0  25                SE2_P3
15  25.0  25                SE2_P2
16  18.0  26                SE2_P4
17  17.0  31                SE2_P4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM