简体   繁体   English

Python监督机器学习

[英]Python Supervised Machine Learning

I am trying to understand how to use scikit for supervised machine learning so I've made up some data belonging to two different sets: set A and set B. I have 18 elements in set A and 18 elements in set B. Each of the elements have three variables. 我试图理解如何使用scikit进行有监督的机器学习,所以我scikit了一些属于两个不同集合的数据:集合A和集合B.我在集合A中有18个元素,在集合B中有18个元素。元素有三个变量。 See below: 见下文:

#SetA
Variable1A = [ 3,4,4,5,4,5,5,6,7,7,5,4,5,6,4,9,3,4]
Variable2A = [ 5,4,4,3,4,5,4,5,4,3,4,5,3,4,3,4,4,3]
Variable3A = [ 7,8,4,5,6,7,3,3,3,4,4,9,7,6,8,6,7,8]


#SetB
Variable1B = [ 7,8,11,12,7,9,8,7,8,11,15,9,7,6,9,9,7,11]
Variable2B = [ 1,2,3,3,4,2,4,1,0,1,2,1,3,4,3,1,2,3]
Variable3B = [ 12,18,14,15,16,17,13,13,13,14,14,19,17,16,18,16,17,18]

How would I use scikit to use supervised machine learning so that when I introduce a new setA and setB data it can try to identify which of the new data belongs to either setA or setB. 我如何使用scikit来使用有监督的机器学习,这样当我引入新的setA和setB数据时,它可以尝试识别哪些新数据属于setA或setB。

Apologies for the data sets are small and 'made up'. 数据集的道歉很小并且“弥补”。 I just want to apply the same method using scikit on other data sets. 我只想在其他数据集上使用scikit应用相同的方法。

Your question is quite broad, so this is just a brief outline. 你的问题很广泛,所以这只是一个简短的概述。 Instead of formatting your data that way, you want to put the two sets together in one list/array, with another column to represent which set each row belongs to. 您不希望以这种方式格式化数据,而是将两个集合放在一个列表/数组中,而另一列则表示每行所属的集合。 Something like this: 像这样的东西:

data = [
    [3, 5, 7, 0]
    [4, 4, 8, 0],  # these rows have 0 as the last element to represent group A
    ...
    [7, 1, 12, 1],
    [8, 2, 18, 1], # these have 1 as the last element to represent group A
    ...
]

An alternative is to put only the first three columns in data and call it X , and then have a separate array y containing just [0, 0, 0, ..., 1, 1, 1, ...] (indicating group membership of each row). 另一种方法是只将前三列放在data并称之为X ,然后有一个单独的数组y只包含[0, 0, 0, ..., 1, 1, 1, ...] (表示组)每一行的成员资格)。 What you want to avoid is having the information about which group a point is in be stored in the name of the variables; 您要避免的是将关于哪个组的点的信息存储在变量的名称中; you instead want to have the "set A or set B" information stored in the values of variables (as here it's stored in the values in the last column of data , or in y ), 你想要将“A组或B组”信息存储在变量中(因为它存储在data的最后一列或y中的值中),

Whatever you do, you'll almost certainly want to use numpy arrays or pandas data structures to hold your data, rather than lists. 无论你做什么,你几乎肯定会想要使用numpy数组或pandas数据结构来保存你的数据,而不是列表。

There are numerous tutorials and examples available for how to use scikit-learn, as well as sample data sets that may be more useful than the one you made up. 有许多关于如何使用scikit-learn的教程和示例,以及可能比您组成的数据集更有用的示例数据集。 "Supervised machine learning" is a broad term incorporating many different approaches to the task of deciding which group a data point is in, so you'll have to play around and try out different classification algorithms. “监督机器学习”是一个广泛的术语,它结合了许多不同的方法来决定数据点所在的组,因此您将不得不四处游戏并尝试不同的分类算法。 All of this info can be found by googling and/or browsing through the scikit documentation. 所有这些信息都可以通过谷歌搜索和/或浏览scikit文档找到。

I think this is a good question and no worries if you get the feeling it was not clear enough. 我认为这是一个很好的问题,如果你感觉不够清楚就不用担心。 Supervised learning can be used to classify an instance (data row) into several categories (or in your case just 2 sets). 监督学习可用于将实例(数据行)分类为几个类别(或者在您的情况下仅为2组)。 What you are missing in the above example is a variable that says in which set 1 row belongs to. 您在上面的示例中缺少的是一个变量,它表示第1行属于哪一行。

import numpy as np # numpy will help us to concatenate the columns into a 2-dimensional array
# so instead of hiving 3 separate arrays, we have 1 array with 3 columns and 18 rows 

Variable1A = [ 3,4,4,5,4,5,5,6,7,7,5,4,5,6,4,9,3,4]
Variable2A = [ 5,4,4,3,4,5,4,5,4,3,4,5,3,4,3,4,4,3]
Variable3A = [ 7,8,4,5,6,7,3,3,3,4,4,9,7,6,8,6,7,8]

#our target variable for A

target_variable_A=[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]

Variable1B = [ 7,8,11,12,7,9,8,7,8,11,15,9,7,6,9,9,7,11]
Variable2B = [ 1,2,3,3,4,2,4,1,0,1,2,1,3,4,3,1,2,3]
Variable3B = [ 12,18,14,15,16,17,13,13,13,14,14,19,17,16,18,16,17,18]

# target variable for B
target_variable_B=[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

#lets create a dataset C with only 4 rows that we need to predict if belongs to "1" which is data set A or "0" which is  dataset B

Variable1C = [ 7,4,4,12]
Variable2C = [ 1,4,4,3]
Variable3C = [ 12,8,4,15]

#make the objects 2-dimenionsal arrays (so 1 array with X rows and 3 columns-variables)
Dataset_A=np.column_stack((Variable1A,Variable2A,Variable3A))
Dataset_B=np.column_stack((Variable1B,Variable2B,Variable3B))
Dataset_C=np.column_stack((Variable1C,Variable2C,Variable3C))

print(" dataset A rows ", Dataset_A.shape[0]," dataset A columns ", Dataset_A.shape[1] )
print(" dataset B rows ", Dataset_B.shape[0]," dataset B columns ", Dataset_B.shape[1] )
print(" dataset C rows ", Dataset_C.shape[0]," dataset C columns ", Dataset_C.shape[1] )

##########Prints ##########
#(' dataset A rows ', 18L, ' dataset A columns ', 3L)
#(' dataset B rows ', 18L, ' dataset B columns ', 3L)
#(' dataset C rows ', 4L, ' dataset C columns ', 3L)

# since now we have an identification that tells us if it belongs to A or B (e.g. 1 or 0) we can append the new sets together
Dataset_AB=np.concatenate((Dataset_A,Dataset_B),axis=0) # this creates a set with 36 rows and 3 columns
target_variable_AB=np.concatenate((target_variable_A,target_variable_B),axis=0)

print(" dataset AB rows ", Dataset_AB.shape[0]," dataset Ab columns ", Dataset_AB.shape[1] )
print(" target Variable rows ", target_variable_AB.shape[0])

##########Prints ##########
#(' dataset AB rows ', 36L, ' dataset Ab columns ', 3L)
#(' target Variable rows ', 36L)

#now we will select the most common supervised scikit model - Logistic Regression
from sklearn.linear_model import LogisticRegression
model=LogisticRegression() # we create an instance of the model

model.fit(Dataset_AB,target_variable_AB) # the model learns to distinguish between A and B (1 or 0)

#now we make predictions for the new dataset C

predictions_for_C=model.predict(Dataset_C)
print(predictions_for_C)
# this will print
#[0 1 1 0]
# so first case belongs to set A , second to B, third to B and fourth to A

Supervised learning means that the data you are providing for training the model is labelled that is the outcome of each sample used for training is known before hand. 监督学习意味着您提供的用于训练模型的数据被标记为用于训练的每个样本的结果是事先已知的。

In the problem you have provided there are basically 2 sets : set A and set B so you will have to use a binary classifier like Logistic Regression model . 在您提供的问题中,基本上有两组:设置A和设置B,因此您必须使用像Logistic回归模型这样的二元分类器。

First label elements of set A and B as either 1 or 0 vice versa based on which set they belong to , that is say if element e belongs to set A marks it as 1 else mark it as 0. 集合A和B的第一个标签元素为1或0,反之亦然,基于它们属于哪个集合,也就是说,如果元素e属于集合A,则将其标记为1,否则将其标记为0。

Then import the Logistic Regression classifier from scikitlearn in python. 然后从python中的scikitlearn导入Logistic回归分类器。

Next thing is that merge both the sets like set A followed by set B or vice versa and in the same order merge the labels you have already provided . 接下来是合并两个集合,如集合A,然后集合B,反之亦然,并以相同的顺序合并您已经提供的标签。

You can use either pandas or numpy for stacking these sets up and preparing the labelled dataset. 您可以使用pandas或numpy来堆叠这些设置并准备标记的数据集。

Now you have a good well labelled dataset . 现在你有一个很好的标签数据集。

You can now call the fit function from the Logistic Regression Classifier with the dataset(containing set A and set B elements) and the label set. 您现在可以使用数据集(包含集合A和集合B元素)和标签集从Logistic回归分类器调用拟合函数。

After that call the predict function with the data you want to test it with you will get the predicted class that is either 0 or 1. 之后,使用您要测试的数据调用预测函数将获得0或1的预测类。

If you want the sets you can use a dictionary to map the keys as 1 and 0 with values 'set A' and 'set B' . 如果你想要这些集,你可以使用字典将键映射为1和0,其值为'set A'和'set B'。 So that you can get the sets from that. 这样你就可以从中得到集合。

import pandas as pd
import numpy as np 
from sklearn.linear_model import LogisticRegression as lr

#set A

firstA=[3,4,4,5,4,5,5,6,7,7,5,4,5,6,4,9,3,4]
secondA=[5,4,4,3,4,5,4,5,4,3,4,5,3,4,3,4,4,3]
thirdA=[7,8,4,5,6,7,3,3,3,4,4,9,7,6,8,6,7,8]

#set B

firstB=[7,8,11,12,7,9,8,7,8,11,15,9,7,6,9,9,7,11]
secondB=[1,2,3,3,4,2,4,1,0,1,2,1,3,4,3,1,2,3]
thirdB=[12,18,14,15,16,17,13,13,13,14,14,19,17,16,18,16,17,18]

#stacking up and building the dataset

Aset=[firstA,secondA,thirdA]
Bset=[firstB,secondB,thirdB]
totalset=[Aset,Bset]


data=pd.DataFrame(columns["0","1","2","3","4","5","6",
"7","8","9","10","11","12","13","14","15","16","17"])
c=0
for i in range(0,2):
    for j in range(0,3):
        data.loc[c]=totalset[i][j]
        c=c+1 
label=np.array([0,0,0,1,1,1])
df2=pd.DataFrame(columns=["0","1","2","3","4","5"])
df2=label


#Training and testing the model

model=lr()
model.fit(df,df2)
k=model.predict([[17,18,14,15,16,17,13,
13,13,41,14,19,17,16,18,16,17,28]])

#mapping(chosen set A element's with label 0 and set B with 1)

dic={0:"set A",1:"set B"}
print(dic[int(k)])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM