使用准备好的数据进行Sci-kit分类

Question

I am trying to use the Sci-kit learn python library to classify a bunch of urls for the presence of certain keywords matching a user profile. 我正在尝试使用Sci-kit学习python库来分类一堆网址，以确定是否存在与用户配置文件匹配的特定关键字。 A user has name, email address ... and a url assigned to them. 用户具有姓名，电子邮件地址......以及分配给他们的网址。 I have created a txt with the result of each profile data match on each link so it is in the format: 我创建了一个txt，每个链接上的每个配置文件数据匹配的结果都是这样的格式：

Name  Email  Address
  0     1      0      =>Relavent
  1     1      0      =>Relavent
  0     1      1      =>Relavent
  0     0      0      =>Not Relavent

Where the 0 or 1 signifies that the attribute was found on the page(each row is a webpage) How do i give this data to the sci-kit so it can use it to run a classifier? 其中0或1表示在页面上找到属性（每行是一个网页）如何将此数据提供给sci-kit以便它可以使用它来运行分类器？ The examples i have seen all have data coming from a predefined sch-kit library such as digits or iris or are being generated in the format i already have. 我看到的例子都有来自预定义的sch-kit库的数据，例如数字或虹膜，或者是以我已有的格式生成的。 I just dont know how to use the data format i have to provide to the library 我只是不知道如何使用我必须提供给库的数据格式

The above is a toy example and i have many more features than 3 以上是一个玩具示例，我有比3更多的功能

Answer 1

The data needed is a numpy array (in this case a "matrix") with the shape (n_samples, n_features) . 所需的数据是具有形状(n_samples, n_features)的numpy数组（在这种情况下为“矩阵” (n_samples, n_features) 。

A simple way to read the csv-file to the right format by using numpy.genfromtxt . 使用numpy.genfromtxt将csv文件读取为正确格式的简单方法。 Also refer this thread . 也参考这个帖子。

Let the contents of a csv file (say file.csv in the current working directory) be: 让csv文件的内容（比如当前工作目录中的file.csv ）为：

a,b,c,target
1,1,1,0
1,0,1,0
1,1,0,1
0,0,1,1
0,1,1,0

To load it we do 要加载它我们做

data = np.genfromtxt('file.csv', skip_header=True)

The skip_header is set to True , to prevent reading the header column (The a,b,c,target line). skip_header设置为True ，以防止读取标题列（ a,b,c,target行）。 Refer numpy's documentation for more details. 有关更多详细信息，请参阅numpy的文档。

Once you load the data, you need to do some pre-processing based on your input data format. 加载数据后，需要根据输入数据格式进行一些预处理。 The preprocessing could be something like splitting the input and the targets (classification) or splitting the whole dataset into a training and validation set (for cross-validation). 预处理可以是分割输入和目标（分类）或将整个数据集拆分为训练和验证集（用于交叉验证）。

To split the input (feature matrix) from the output (target vector) we do 要从输出（目标矢量）中分割输入（特征矩阵），我们这样做

features = data[:, :3]
targets = data[:, 3]   # The last column is identified as the target

For the above given CSV data, the arrays will use will look like: 对于上面给出的CSV数据，数组将使用如下所示：

features = array([[ 0, 1, 0],
              [ 1, 1, 0],
              [ 0, 1, 1],
              [ 0, 0, 0]])  # shape = ( 4, 3)

targets = array([ 1, 1, 1, 0])  # shape = ( 4, )

Now these matrices are passed to the estimator objects fit function. 现在将这些矩阵传递给估计器对象fit函数。 If you are using the popular svm classifier then 如果你正在使用流行的svm分类器那么

>>> from sklearn.svm import LinearSVC
>>> linear_svc_model = LinearSVC()
>>> linear_svc_model.fit(X=features, y=targets)

使用准备好的数据进行Sci-kit分类

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-02-01 02:46:36

使用准备好的数据进行Sci-kit分类

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-02-01 02:46:36

解决方案1
3 已采纳 2014-02-01 02:46:36