简体   繁体   English

如何获取SKLearn的ndarray格式的数据?

[英]How do I get data into the ndarray format for SKLearn?

Scikit-Learn is a great Python module that supplies a support vector machine with many algorithms. Scikit-Learn是一个很棒的Python模块,它为支持向量机提供了许多算法。 I've been learning how to use the module for the past few days, and I've noticed it relies heavily on the separate numpy module. 在过去的几天中,我一直在学习如何使用该模块,并且我注意到它在很大程度上依赖于单独的numpy模块。

I understand what the module does, but I'm still learning about how it works. 我了解该模块的功能,但我仍在学习其工作方式。 Here is a very brief example of what I'm using sklearn for: 这是我使用sklearn一个非常简短的示例:

from sklearn import datasets, svm
import numpy

digits = datasets.load_digits() #image pixel data of digits 0-9 as well as a chart of the corresponding digit to each image

clf = svm.SVC(gamma=0.001,C=100) #SVC is the algorithm used for classifying this type of data

x,y = digits.data[:-1], digits.target[:-1] #feed it all the data
clf.fit(x,y) #"train" the SVM

print(clf.predict(digits.data[0])) #>>>[0]
#with 99% accuracy, all of the data consists of 1797 samples.
#if this number gets smaller, the accuracy decreases. with 10 samples (0-9),
#accuracy can still be up to as high as 90%.

That's very basic classification. 这是非常基本的分类。 There are 10 classes: 0,1,2,3,4,5,6,7,8,9. 有10个类别:0、1、2、3、4、5、6、7、8、9。

Using the following code with matplotlib.pyplot: 将以下代码与matplotlib.pyplot结合使用:

import matplotlib.pyplot as plt #in shell after running previous code
plt.imshow(digits.images[0],cmap=plt.cm.gray_r,interpolation="nearest")
plt.show()

gives the following image: 给出以下图像: 在此处输入图片说明

The first pixel (left to right, top to bottom, like reading) would be represented by a 0. Same for the second, but the third would be represented by 7 or something (range is 0 to 15), fourth being about 13. Here's the actual data for the image: 第一个像素(从左到右,从上到下,如阅读)将由0表示。第二个像素相同,但第三个像素将由7或类似的东西(范围为0到15)表示,第四个约为13。这是图像的实际数据:

[[  0.   0.   5.  13.   9.   1.   0.   0.]
 [  0.   0.  13.  15.  10.  15.   5.   0.]
 [  0.   3.  15.   2.   0.  11.   8.   0.]
 [  0.   4.  12.   0.   0.   8.   8.   0.]
 [  0.   5.   8.   0.   0.   9.   8.   0.]
 [  0.   4.  11.   0.   1.  12.   7.   0.]
 [  0.   2.  14.   5.  10.  12.   0.   0.]
 [  0.   0.   6.  13.  10.   0.   0.   0.]]

So my question is this: if I wanted to classify text data, for example forum posts in the wrong subforum/category, how would I convert that data into the number system used in this dataset example? 所以我的问题是这样的:如果我想对文本数据进行分类,例如错误的子论坛/类别中的论坛帖子,我如何将这些数据转换为该数据集示例中使用的数字系统?

For each sample(eg each Forum Post) you must have a vector(in python a list). 对于每个样本(例如每个论坛帖子),您必须有一个向量(在python中为列表)。 For example if you have 200 post and their respective category, you must have 200 list for training data and exactly one list that has 200 element for each 200 category. 例如,如果您有200个帖子及其各自的类别,则必须有200个训练数据列表,并且每个列表中的200个元素必须有一个包含200个元素的列表。 each list of traning category can be a model(eg Bag Of Word. see here: https://en.wikipedia.org/wiki/Bag-of-words_model ). 转换类别的每个列表都可以是一个模型(例如Word的Bag。请参见此处: https : //en.wikipedia.org/wiki/Bag-of-words_model )。 Notice that all list for training must have same element(same dimention)(for example each list must have 3000 element that each element reperesnt present or absent of a word) Try to look at this, it's easy for begginers: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words 请注意,所有要训练的列表都必须具有相同的元素(相同尺寸)(例如,每个列表必须具有3000个元素,每个元素都存在或不存在一个单词)尝试看一下,对于初学者来说很容易: https:// www .kaggle.com / c / word2vec-nlp-tutorial / details / part-1-for-beginners-bag-of-words

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM