使用libsvm功能的例子在Python中支持向量機

Question

我已經抓了很多像這樣的ebay游戲：

Apple iPhone 5 White 16GB Dual-Core

我用這種方式手動標記了所有這些內容

B M C S NA

其中B =品牌（Apple）M =型號（iPhone 5）C =顏色（白色）S =尺寸（尺寸）NA =未指定（雙核）

現在我需要使用python中的libsvm庫訓練SVM分類器，以了解ebay標題中出現的序列模式。

我需要通過將問題視為分類來為該屬性（品牌，模型，顏色，大小）提取新值。 通過這種方式，我可以預測新模型。

我想考慮這個功能：

* Position
- from the beginning of the title
- to the end of the listing
* Orthographic features
- current word contains a digit
- current word is capitalized 
....

我無法理解如何將所有這些信息提供給庫。 官方文檔缺乏很多信息

我的班級是品牌，型號，尺寸，顏色，NA

SVM算法的輸入文件必須包含什么？

我怎么創建它？ 考慮到我在問題中提供的4個功能，我可以舉一個該文件的示例嗎？ 我是否還可以使用一些示例來詳細說明輸入文件？

*更新*我想代表這些功能......我該怎么辦？

當前單詞的身份

我想我可以用這種方式來解釋它

0 --> Brand
1 --> Model
2 --> Color
3 --> Size 
4 --> NA

如果我知道這個單詞是Brand，我會將該變量設置為1（true）。 在訓練測試中可以這樣做（因為我已經標記了所有單詞）但是我怎樣才能為測試集做到這一點？ 我不知道一個詞的類別是什么（這就是我學習它的原因：D）。

當前單詞的N-gram子串特征（N = 4,5,6）沒有想法，這意味着什么？
當前單詞前2個單詞的標識。 我該如何建模此功能？

考慮到我為第一個功能創建的傳奇，我有5 ^（5）組合）

00 10 20 30 40
01 11 21 31 41
02 12 22 32 42
03 13 23 33 43
04 14 24 34 44

如何將其轉換為libsvm（或scikit-learn）可以理解的格式？

4個屬性字典的成員資格

我怎么能這樣做？ 有4個字典（顏色，大小，型號和品牌）我必須創建一個bool變量，我將設置為true，當且僅當我在4個字典之一中有當前單詞的匹配時。

品牌詞典的獨家會員資格

我認為像4.功能一樣，我必須使用bool變量。 你同意嗎？

Answer 1

以下是如何使用數據訓練SVM然后使用相同數據集進行評估的分步指南。 它也可以在http://nbviewer.ipython.org/gist/anonymous/2cf3b993aab10bf26d5f上找到。 在網址上你還可以看到中間數據的輸出和結果的准確性（這是一個iPython筆記本）

第0步：安裝依賴項

您需要安裝以下庫：

大熊貓
scikit學習

從命令行：

pip install pandas
pip install scikit-learn

第1步：加載數據

我們將使用pandas來加載我們的數據。 pandas是一個可以輕松加載數據的庫。 為了說明，我們首先將樣本數據保存到csv然后加載它。

我們將訓練SVM與train.csv並獲得與測試標簽test.csv

import pandas as pd

train_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1"""


with open('train.csv', 'w') as output:
    output.write(train_data_contents)

train_dataframe = pd.read_csv('train.csv')

第2步：處理數據

我們將數據幀轉換為numpy數組，這是scikit-learn理解的格式。

我們需要將標簽“B”，“M”，“C”，......轉換為數字，因為svm不理解字符串。

然后我們將用數據訓練線性svm

import numpy as np

train_labels = train_dataframe.class_label
labels = list(set(train_labels))
train_labels = np.array([labels.index(x) for x in train_labels])
train_features = train_dataframe.iloc[:,1:]
train_features = np.array(train_features)

print "train labels: "
print train_labels
print 
print "train features:"
print train_features

我們在這里看到train_labels （5）的長度與我們在trainfeatures行數完全匹配。 train_labels每個項目對應一行。

第3步：訓練SVM

from sklearn import svm
classifier = svm.SVC()
classifier.fit(train_features, train_labels)

第4步：在某些測試數據上評估SVM

test_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1
"""

with open('test.csv', 'w') as output:
    output.write(test_data_contents)

test_dataframe = pd.read_csv('test.csv')

test_labels = test_dataframe.class_label
labels = list(set(test_labels))
test_labels = np.array([labels.index(x) for x in test_labels])

test_features = test_dataframe.iloc[:,1:]
test_features = np.array(test_features)

results = classifier.predict(test_features)
num_correct = (results == test_labels).sum()
recall = num_correct / len(test_labels)
print "model accuracy (%): ", recall * 100, "%"

鏈接和提示

如何加載LinearSVC的示例代碼： http ：//scikitlearn.org/stable/modules/svm.html#svm
很長的scikit-learn示例列表： http ：//scikitlearn.org/stable/auto_examples/index.html。 我發現這些有點溫和，但經常讓我感到困惑。
如果您發現SVM需要很長時間進行訓練，請嘗試使用LinearSVC： http ：//scikitlearn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
這是關於熟悉機器學習模型的另一個教程： http ： //scikit-learn.org/stable/tutorial/basic/tutorial.html

您應該能夠獲取此代碼並將train.csv替換為您的訓練數據， test.csv替換您的測試數據，並獲得測試數據的預測以及准確性結果。

請注意，由於您正在使用您訓練過的數據進行評估，因此准確度會異常高。

Answer 2

我回應@MarcoPashkov的評論，但會嘗試詳細說明LibSVM文件格式。 我發現文檔很全面但很難找到，對於Python lib我推薦GitHub上的README 。

要識別的一個重要部分是存在稀疏格式，其中所有0的特征都被移除，並且不刪除0的特征的密集格式。 這兩個是從README中取得的每個的等效示例。

# Dense data
>>> y, x = [1,-1], [[1,0,1], [-1,0,-1]]
# Sparse data
>>> y, x = [1,-1], [{1:1, 3:1}, {1:-1,3:-1}]

y變量存儲數據的所有類別的列表。

x變量存儲特征向量。

assert len(y) == len(x), "Both lists should be the same length"

心率范例中的格式是稀疏格式，其中字典鍵是特征索引，字典值是特征值，而第一個值是類別。

當您的特征向量使用Bag of Words Representation時，稀疏格式非常有用。

由於大多數文檔通常使用語料庫中使用的字的非常小的子集，因此得到的矩陣將具有許多零（通常超過99％）的特征值。

例如，10,000個短文本文檔（例如電子郵件）的集合將使用總數為100,000個唯一單詞的詞匯表，而每個文檔將單獨使用100到1000個單獨的單詞。

對於使用您開始使用的特征向量的示例，我訓練了一個基本的LibSVM 3.20模型。 此代碼不應使用，但可能有助於展示如何創建和測試模型。

from collections import namedtuple
# Using namedtuples for descriptive purposes, in actual code a normal tuple would work fine.
Category = namedtuple("Category", ["index", "name"])
Feature = namedtuple("Feature", ["category_index", "distance_from_beginning", "distance_from_end", "contains_digit", "capitalized"])

# Separate up the set of categories, libsvm requires a numerical index so we associate each with an index.
categories = dict()
for index, name in enumerate("B M C S NA".split(' ')):
    # LibSVM expects index to start at 1, not 0.
    categories[name] = Category(index + 1, name)
categories

Out[0]: {'B': Category(index=1, name='B'),
   'C': Category(index=3, name='C'),
   'M': Category(index=2, name='M'),
   'NA': Category(index=5, name='NA'),
   'S': Category(index=4, name='S')}

# Faked set of CSV input for example purposes.
csv_input_lines = """category_index,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
NA,12,0,0,1""".split("\n")
# We just ignore the header.
header = csv_input_lines[0]

# A list of Feature namedtuples, this will be trained as lists.
features = list()
for line in csv_input_lines[1:]:
    split_values = line.split(',')
    # Create a Feature with the values converted to integers.
    features.append(Feature(categories[split_values[0]].index, *map(int, split_values[1:])))

features

Out[1]: [Feature(category_index=1, distance_from_beginning=1, distance_from_end=10, contains_digit=1, capitalized=0),
 Feature(category_index=2, distance_from_beginning=10, distance_from_end=1, contains_digit=0, capitalized=1),
 Feature(category_index=3, distance_from_beginning=2, distance_from_end=3, contains_digit=0, capitalized=1),
 Feature(category_index=4, distance_from_beginning=23, distance_from_end=2, contains_digit=0, capitalized=0),
 Feature(category_index=5, distance_from_beginning=12, distance_from_end=0, contains_digit=0, capitalized=1)]

# Y is the category index used in training for each Feature. Now it is an array (order important) of all the trained indexes.
y = map(lambda f: f.category_index, features)
# X is the feature vector, for this we convert all the named tuple's values except the category which is at index 0.
x = map(lambda f: list(f)[1:], features)

from svmutil import svm_parameter, svm_problem, svm_train, svm_predict
# Barebones defaults for SVM
param = svm_parameter()
# The (Y,X) parameters should be the train dataset.
prob = svm_problem(y, x)
model=svm_train(prob, param)

# For actual accuracy checking, the (Y,X) parameters should be the test dataset.
p_labels, p_acc, p_vals = svm_predict(y, x, model)

Out[3]: Accuracy = 100% (5/5) (classification)

我希望這個例子有所幫助，它不應該用於你的訓練。 這只是一個例子，因為它效率低下。

使用libsvm功能的例子在Python中支持向量機

問題描述

2 個解決方案

解決方案1
11 已采納 2015-06-27 22:00:00

第0步：安裝依賴項

第1步：加載數據

第2步：處理數據

第3步：訓練SVM

第4步：在某些測試數據上評估SVM

鏈接和提示

解決方案2
2 2015-06-27 11:51:56

使用libsvm功能的例子在Python中支持向量機

問題描述

2 個解決方案

解決方案1 11 已采納 2015-06-27 22:00:00

第0步：安裝依賴項

第1步：加載數據

第2步：處理數據

第3步：訓練SVM

第4步：在某些測試數據上評估SVM

鏈接和提示

解決方案2 2 2015-06-27 11:51:56

解決方案1
11 已采納 2015-06-27 22:00:00

解決方案2
2 2015-06-27 11:51:56