如何在scikit-learn中使用字符串内核？

Question

I am trying to generate a string kernel that feeds a support vector classifier. 我正在尝试生成一个字符串内核，它提供支持向量分类器。 I tried it with a function that calculates the kernel, something like that 我尝试使用计算内核的函数，就像那样

def stringkernel(K, G):
    for a in range(len(K)):
        for b in range(len(G)):
            R[a][b] = scipy.exp(editdistance(K[a] , G[b]) ** 2)
    return R

And when I pass it to SVC as a parameter I get 当我把它作为参数传递给SVC时，我得到了

 clf = svm.SVC(kernel = my_kernel)
 clf.fit(data, target)

 ValueError: could not convert string to float: photography

where my data is a list of strings and the target is the correspondent class this string belongs to. 其中我的数据是字符串列表，目标是该字符串所属的对应类。 I have reviewed some questions in stackoverflow regarding this issue, but I think a Bag-of-words representations is not appropiate for this case. 我已经回顾了有关此问题的stackoverflow中的一些问题，但我认为对于这种情况，词袋表示并不合适。

Answer 1

This is a limitation in scikit-learn that has proved hard to get rid of. 这是scikit-learn的一个限制，已经证明很难摆脱。 You can try this workaround . 您可以尝试此解决方法。 Represent the strings in feature vectors with only one feature, which is really just an index into the table of strings. 用特征向量表示只有一个特征的字符串，这实际上只是字符串表的索引。

>>> data = ["foo", "bar", "baz"]
>>> X = np.arange(len(data)).reshape(-1, 1)
>>> X
array([[0],
       [1],
       [2]])

Redefine the string kernel function to work on this representation: 重新定义字符串内核函数以处理此表示：

>>> def string_kernel(X, Y):
...     R = np.zeros((len(x), len(y)))
...     for x in X:
...         for y in Y:
...             i = int(x[0])
...             j = int(y[0])
...             # simplest kernel ever
...             R[i, j] = data[i][0] == data[j][0]
...     return R
... 
>>> clf = SVC(kernel=string_kernel)
>>> clf.fit(X, ['no', 'yes', 'yes'])
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel=<function string_kernel at 0x7f5988f0bde8>, max_iter=-1,
  probability=False, random_state=None, shrinking=True, tol=0.001,
  verbose=False)

The downside to this is that to classify new samples, you have to add them to data , then construct new pseudo-feature vectors for them. 这样做的缺点是，要对新样本进行分类，您必须将它们添加到data ，然后为它们构建新的伪特征向量。

>>> data.extend(["bla", "fool"])
>>> clf.predict([[3], [4]])
array(['yes', 'no'], 
      dtype='|S3')

(You can get around this by doing more interpretation of your pseudo-features, eg, looking into a different table for i >= len(X_train) . But it's still cumbersome.) （您可以通过对伪特征进行更多解释来解决这个问题，例如，查看i >= len(X_train)的不同表格。但它仍然很麻烦。）

This is an ugly hack, but it works (it's slightly less ugly for clustering because there the dataset doesn't change after fit ). 这是一个丑陋的黑客，但它的工作原理（它对集群来说稍微不那么难看，因为数据集在fit后不会改变）。 Speaking on behalf of the scikit-learn developers, I say a patch to fix this properly is welcome. 代表scikit-learn开发人员发言，我说一个补丁来正确解决这个问题是值得欢迎的。

Answer 2

我认为shogun库可能是解决方案，也是免费和开源的，我建议查看这个例子： https ： //github.com/shogun-toolbox/shogun/tree/develop/src/shogun/kernel/string

如何在scikit-learn中使用字符串内核？

问题描述

2 个解决方案

解决方案1
4 已采纳 2014-10-16 21:05:28

解决方案2
1 2014-10-16 01:16:42

如何在scikit-learn中使用字符串内核？

问题描述

2 个解决方案

解决方案1 4 已采纳 2014-10-16 21:05:28

解决方案2 1 2014-10-16 01:16:42

解决方案1
4 已采纳 2014-10-16 21:05:28

解决方案2
1 2014-10-16 01:16:42