在scikit中实现K邻居分类器和线性SVM - 学习单词感知消歧

Question

我正在尝试使用线性SVM和K邻居分类器来进行词义消歧（WSD）。 以下是我用来训练数据的一段数据：

<corpus lang="English">

<lexelt item="activate.v">


<instance id="activate.v.bnc.00024693" docsrc="BNC">
<answer instance="activate.v.bnc.00024693" senseid="38201"/>
<context>
Do you know what it is ,  and where I can get one ?  We suspect you had seen the Terrex Autospade ,  which is made by Wolf Tools .  It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly ,  you should n't have to bend your back during general digging ,  although it wo n't lift out the soil and put in a barrow if you need to move it !  If gardening tends to give you backache ,  remember to take plenty of rest periods during the day ,  and never try to lift more than you can easily cope with .  
</context>
</instance>


<instance id="activate.v.bnc.00044852" docsrc="BNC">
<answer instance="activate.v.bnc.00044852" senseid="38201"/>
<answer instance="activate.v.bnc.00044852" senseid="38202"/>
<context>
For neurophysiologists and neuropsychologists ,  the way forward in understanding perception has been to correlate these dimensions of experience with ,  firstly ,  the material properties of the experienced object or event  ( usually regarded as the stimulus )  and ,  secondly ,  the patterns of discharges in the sensory system .  Qualitative Aspects of Experience The quality or modality of the experience depends less upon the quality of energy reaching the nervous system than upon which parts of the sensory system are <head>activated</head> : stimulation of the retinal receptors causes an experience of light ; stimulation of the receptors in the inner ear gives rise to the experience of sound ; and so on . Muller 's  nineteenth - century  doctrine of specific energies  formalized the ordinary observation that different sense organs are sensitive to different physical properties of the world and that when they are stimulated ,  sensations specific to those organs are experienced .  It was proposed that there are endings  ( or receptors )  within the nervous system which are attuned to specific types of energy ,  For example ,  retinal receptors in the eye respond to light energy ,  cochlear endings in the ear to vibrations in the air ,  and so on .  
</context>
</instance>
.....

训练和测试数据之间的区别在于测试数据没有“答案”标签 。 我已经构建了一个字典来存储每个实例的“head”字的邻居，其窗口大小为10. 当一个实例有多个时，我只考虑第一个 。 我还构建了一个记录训练文件中所有词汇表的集合，以便为每个实例计算一个向量。 例如，如果总词汇表是[a，b，c，d，e]，并且一个实例具有单词[a，a，d，d，e]，则该实例的结果向量将是[2,0， 0,2,1] 。 这是我为每个单词构建的字典的一部分：

{
    "activate.v": {
        "activate.v.bnc.00024693": {
            "instanceId": "activate.v.bnc.00024693", 
            "senseId": "38201", 
            "vocab": {
                "although": 1, 
                "back": 1, 
                "bend": 1, 
                "bicycl": 1, 
                "correct": 1, 
                "dig": 1, 
                "general": 1, 
                "handlebar": 1, 
                "hefti": 1, 
                "lever": 1, 
                "nt": 2, 
                "quit": 1, 
                "rear": 1, 
                "spade": 1, 
                "sprung": 1, 
                "step": 1, 
                "type": 1, 
                "use": 1, 
                "wo": 1
            }
        }, 
        "activate.v.bnc.00044852": {
            "instanceId": "activate.v.bnc.00044852", 
            "senseId": "38201", 
            "vocab": {
                "caus": 1, 
                "ear": 1, 
                "energi": 1, 
                "experi": 1, 
                "inner": 1, 
                "light": 1, 
                "nervous": 1, 
                "part": 1, 
                "qualiti": 1, 
                "reach": 1, 
                "receptor": 2, 
                "retin": 1, 
                "sensori": 1, 
                "stimul": 2, 
                "system": 2, 
                "upon": 2
            }
        }, 
        ......

现在，我只需要从scikit提供K邻居分类器和线性SVM的输入 - 学习训练分类器。 但我不确定应该如何为每个构建特征向量和标签。 我的理解是标签应该是“答案”中的实例标签和senseid标签的元组。 但我不确定特征向量。 我应该在“回答”中对来自同一个具有相同实例标签和senseid标签的所有向量进行分组吗？ 但每个单词大约有100个单词和数百个实例，我该怎么处理呢？

另外，vector是一个特性，我需要稍后添加更多功能，例如synset，hypernyms，hyponyms等 。 我该怎么做？

提前致谢！

Answer 1

机器学习问题是一种优化任务，您没有预定义的最佳所有算法，而是使用不同的方法，参数和数据预处理来摸索最佳结果。 所以你绝对是从最简单的任务开始 - 只需要一个单词和几个意义。

但我不确定应该如何为每个构建特征向量和标签。

您可以将这些值作为矢量组件。 枚举矢量字并在每个文本中写入此类字的数字。 如果单词不存在，则输入空值。 我稍微修改了你的例子以澄清这个想法：

vocab_38201= {
            "although": 1, 
            "back": 1, 
            "bend": 1, 
            "bicycl": 1, 
            "correct": 1, 
            "dig": 1, 
            "general": 1, 
            "handlebar": 1, 
            "hefti": 1, 
            "lever": 1, 
            "nt": 2, 
            "quit": 1, 
            "rear": 1, 
            "spade": 1, 
            "sprung": 1, 
            "step": 1, 
            "type": 1, 
            "use": 1, 
            "wo": 1
        }

vocab_38202 = {
            "caus": 1, 
            "ear": 1, 
            "energi": 1, 
            "experi": 1, 
            "inner": 1, 
            "light": 1, 
            "nervous": 1, 
            "part": 1, 
            "qualiti": 1, 
            "reach": 1, 
            "receptor": 2, 
            "retin": 1, 
            "sensori": 1, 
            "stimul": 2, 
            "system": 2, 
            "upon": 2,
            "wo": 1     ### added so they have at least one common word
        }

让我们把它转换为特征向量。 枚举所有单词并标记该单词在词汇表中的次数。

from collections import defaultdict
words = []

def get_components(vect_dict):
    vect_components = defaultdict(int)
    for word, num in vect_dict.items():
        try:
           ind = words.index(word)
        except ValueError:
           ind = len(words)
           words.append(word)
        vect_components[ind] += num
    return vect_components


#  
vect_comps_38201 = get_components(vocab_38201)
vect_comps_38202 = get_components(vocab_38202)

我们看看吧：

>>> print(vect_comps_38201)
defaultdict(<class 'int'>, {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 2, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1})

>>> print(vect_comps_38202)
defaultdict(<class 'int'>, {32: 1, 33: 2, 34: 1, 7: 1, 19: 2, 20: 2, 21: 1, 22: 1, 23: 1, 24: 1, 25: 1, 26: 1, 27: 2, 28: 1, 29: 1, 30: 1, 31: 1})

>>> vect_38201=[vect_comps_38201.get(i,0) for i in range(len(words))]
>>> print(vect_38201)
[1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

>>> vect_38202=[vect_comps_38202.get(i,0) for i in range(len(words))]
>>> print(vect_38202)
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1]

这些vect_38201和vect38202是您可以在拟合模型中使用的向量：

from sklearn.svm import SVC
X = [vect_38201, vect_38202]
y = [38201, 38202]
clf = SVC()
clf.fit(X, y)
clf.predict([[0, 0, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 2, 1]])

输出：

array([38202])

当然这是一个非常简单的例子，只是展示了这个概念。

你能做些什么来改善它？

标准化矢量坐标。
使用优秀的工具Tf-Idf矢量化器从文本中提取数据特征。
添加更多数据。

祝好运！

Answer 2

下一步 - 实现多维线性分类器。

不幸的是我没有访问这个数据库，所以这有点理论上。 我可以提出这种方法：

将所有数据强制转换为一个CSV文件，如下所示：

SenseId,Word,Text,IsHyponim,Properties,Attribute1,Attribute2, ...
30821,"BNC","For neurophysiologists and ...","Hyponym sometype",1,1
30822,"BNC","Do you know what it is ...","Antonym type",0,1
...

接下来，您可以使用sklearn工具：

import pandas as pd
df.read_csv('file.csv')

from sklearn.feature_extraction import DictVectorizer
enc=DictVectorizer()
X_train_categ = enc.fit_transform(df[['Properties',]].to_dict('records'))

from sklearn.feature_extraction.text import TfidfVectorizer
vec=TfidfVectorizer(min_df=5)  # throw out all terms which present in less than 5 documents - typos and so on
v=vec.fit_transform(df['Text'])

# Join all date together as a sparsed matrix
from scipy.sparse import csr_matrix, hstack
train=hstack( (csr_matrix(df.ix[:, 'Word':'Text']), X_train_categ, v))
y = df['SenseId']

# here you have an matrix with really huge dimensionality - about dozens of thousand columns 
# you may use Ridge regression to deal with it:
from sklearn.linear_model import Ridge
r=Ridge(random_state=241, alpha=1.0)

# prepare test data like training one

更多细节： Ridge ， Ridge Classifier 。

处理高维度问题的其他技术。

使用稀疏特征矩阵进行文本分类的代码示例。

在scikit中实现K邻居分类器和线性SVM - 学习单词感知消歧

问题描述

2 个解决方案

解决方案1
2 2016-06-06 17:13:25

解决方案2
2 2016-06-08 12:47:22

在scikit中实现K邻居分类器和线性SVM - 学习单词感知消歧

问题描述

2 个解决方案

解决方案1 2 2016-06-06 17:13:25

解决方案2 2 2016-06-08 12:47:22

解决方案1
2 2016-06-06 17:13:25

解决方案2
2 2016-06-08 12:47:22