[英]How to use string kernels in scikit-learn?
I am trying to generate a string kernel that feeds a support vector classifier. 我正在尝试生成一个字符串内核,它提供支持向量分类器。 I tried it with a function that calculates the kernel, something like that
我尝试使用计算内核的函数,就像那样
def stringkernel(K, G):
for a in range(len(K)):
for b in range(len(G)):
R[a][b] = scipy.exp(editdistance(K[a] , G[b]) ** 2)
return R
And when I pass it to SVC as a parameter I get 当我把它作为参数传递给SVC时,我得到了
clf = svm.SVC(kernel = my_kernel)
clf.fit(data, target)
ValueError: could not convert string to float: photography
where my data is a list of strings and the target is the correspondent class this string belongs to. 其中我的数据是字符串列表,目标是该字符串所属的对应类。 I have reviewed some questions in stackoverflow regarding this issue, but I think a Bag-of-words representations is not appropiate for this case.
我已经回顾了有关此问题的stackoverflow中的一些问题,但我认为对于这种情况,词袋表示并不合适。
This is a limitation in scikit-learn that has proved hard to get rid of. 这是scikit-learn的一个限制,已经证明很难摆脱。 You can try this workaround .
您可以尝试此解决方法 。 Represent the strings in feature vectors with only one feature, which is really just an index into the table of strings.
用特征向量表示只有一个特征的字符串,这实际上只是字符串表的索引。
>>> data = ["foo", "bar", "baz"]
>>> X = np.arange(len(data)).reshape(-1, 1)
>>> X
array([[0],
[1],
[2]])
Redefine the string kernel function to work on this representation: 重新定义字符串内核函数以处理此表示:
>>> def string_kernel(X, Y):
... R = np.zeros((len(x), len(y)))
... for x in X:
... for y in Y:
... i = int(x[0])
... j = int(y[0])
... # simplest kernel ever
... R[i, j] = data[i][0] == data[j][0]
... return R
...
>>> clf = SVC(kernel=string_kernel)
>>> clf.fit(X, ['no', 'yes', 'yes'])
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel=<function string_kernel at 0x7f5988f0bde8>, max_iter=-1,
probability=False, random_state=None, shrinking=True, tol=0.001,
verbose=False)
The downside to this is that to classify new samples, you have to add them to data
, then construct new pseudo-feature vectors for them. 这样做的缺点是,要对新样本进行分类,您必须将它们添加到
data
,然后为它们构建新的伪特征向量。
>>> data.extend(["bla", "fool"])
>>> clf.predict([[3], [4]])
array(['yes', 'no'],
dtype='|S3')
(You can get around this by doing more interpretation of your pseudo-features, eg, looking into a different table for i >= len(X_train)
. But it's still cumbersome.) (您可以通过对伪特征进行更多解释来解决这个问题,例如,查看
i >= len(X_train)
的不同表格。但它仍然很麻烦。)
This is an ugly hack, but it works (it's slightly less ugly for clustering because there the dataset doesn't change after fit
). 这是一个丑陋的黑客,但它的工作原理(它对集群来说稍微不那么难看,因为数据集在
fit
后不会改变)。 Speaking on behalf of the scikit-learn developers, I say a patch to fix this properly is welcome. 代表scikit-learn开发人员发言,我说一个补丁来正确解决这个问题是值得欢迎的。
我认为shogun库可能是解决方案,也是免费和开源的,我建议查看这个例子: https : //github.com/shogun-toolbox/shogun/tree/develop/src/shogun/kernel/string
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.