在scikit中實現K鄰居分類器和線性SVM - 學習單詞感知消歧

Question

我正在嘗試使用線性SVM和K鄰居分類器來進行詞義消歧（WSD）。 以下是我用來訓練數據的一段數據：

<corpus lang="English">

<lexelt item="activate.v">


<instance id="activate.v.bnc.00024693" docsrc="BNC">
<answer instance="activate.v.bnc.00024693" senseid="38201"/>
<context>
Do you know what it is ,  and where I can get one ?  We suspect you had seen the Terrex Autospade ,  which is made by Wolf Tools .  It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly ,  you should n't have to bend your back during general digging ,  although it wo n't lift out the soil and put in a barrow if you need to move it !  If gardening tends to give you backache ,  remember to take plenty of rest periods during the day ,  and never try to lift more than you can easily cope with .  
</context>
</instance>


<instance id="activate.v.bnc.00044852" docsrc="BNC">
<answer instance="activate.v.bnc.00044852" senseid="38201"/>
<answer instance="activate.v.bnc.00044852" senseid="38202"/>
<context>
For neurophysiologists and neuropsychologists ,  the way forward in understanding perception has been to correlate these dimensions of experience with ,  firstly ,  the material properties of the experienced object or event  ( usually regarded as the stimulus )  and ,  secondly ,  the patterns of discharges in the sensory system .  Qualitative Aspects of Experience The quality or modality of the experience depends less upon the quality of energy reaching the nervous system than upon which parts of the sensory system are <head>activated</head> : stimulation of the retinal receptors causes an experience of light ; stimulation of the receptors in the inner ear gives rise to the experience of sound ; and so on . Muller 's  nineteenth - century  doctrine of specific energies  formalized the ordinary observation that different sense organs are sensitive to different physical properties of the world and that when they are stimulated ,  sensations specific to those organs are experienced .  It was proposed that there are endings  ( or receptors )  within the nervous system which are attuned to specific types of energy ,  For example ,  retinal receptors in the eye respond to light energy ,  cochlear endings in the ear to vibrations in the air ,  and so on .  
</context>
</instance>
.....

訓練和測試數據之間的區別在於測試數據沒有“答案”標簽 。 我已經構建了一個字典來存儲每個實例的“head”字的鄰居，其窗口大小為10. 當一個實例有多個時，我只考慮第一個 。 我還構建了一個記錄訓練文件中所有詞匯表的集合，以便為每個實例計算一個向量。 例如，如果總詞匯表是[a，b，c，d，e]，並且一個實例具有單詞[a，a，d，d，e]，則該實例的結果向量將是[2,0， 0,2,1] 。 這是我為每個單詞構建的字典的一部分：

{
    "activate.v": {
        "activate.v.bnc.00024693": {
            "instanceId": "activate.v.bnc.00024693", 
            "senseId": "38201", 
            "vocab": {
                "although": 1, 
                "back": 1, 
                "bend": 1, 
                "bicycl": 1, 
                "correct": 1, 
                "dig": 1, 
                "general": 1, 
                "handlebar": 1, 
                "hefti": 1, 
                "lever": 1, 
                "nt": 2, 
                "quit": 1, 
                "rear": 1, 
                "spade": 1, 
                "sprung": 1, 
                "step": 1, 
                "type": 1, 
                "use": 1, 
                "wo": 1
            }
        }, 
        "activate.v.bnc.00044852": {
            "instanceId": "activate.v.bnc.00044852", 
            "senseId": "38201", 
            "vocab": {
                "caus": 1, 
                "ear": 1, 
                "energi": 1, 
                "experi": 1, 
                "inner": 1, 
                "light": 1, 
                "nervous": 1, 
                "part": 1, 
                "qualiti": 1, 
                "reach": 1, 
                "receptor": 2, 
                "retin": 1, 
                "sensori": 1, 
                "stimul": 2, 
                "system": 2, 
                "upon": 2
            }
        }, 
        ......

現在，我只需要從scikit提供K鄰居分類器和線性SVM的輸入 - 學習訓練分類器。 但我不確定應該如何為每個構建特征向量和標簽。 我的理解是標簽應該是“答案”中的實例標簽和senseid標簽的元組。 但我不確定特征向量。 我應該在“回答”中對來自同一個具有相同實例標簽和senseid標簽的所有向量進行分組嗎？ 但每個單詞大約有100個單詞和數百個實例，我該怎么處理呢？

另外，vector是一個特性，我需要稍后添加更多功能，例如synset，hypernyms，hyponyms等 。 我該怎么做？

提前致謝！

Answer 1

機器學習問題是一種優化任務，您沒有預定義的最佳所有算法，而是使用不同的方法，參數和數據預處理來摸索最佳結果。 所以你絕對是從最簡單的任務開始 - 只需要一個單詞和幾個意義。

但我不確定應該如何為每個構建特征向量和標簽。

您可以將這些值作為矢量組件。 枚舉矢量字並在每個文本中寫入此類字的數字。 如果單詞不存在，則輸入空值。 我稍微修改了你的例子以澄清這個想法：

vocab_38201= {
            "although": 1, 
            "back": 1, 
            "bend": 1, 
            "bicycl": 1, 
            "correct": 1, 
            "dig": 1, 
            "general": 1, 
            "handlebar": 1, 
            "hefti": 1, 
            "lever": 1, 
            "nt": 2, 
            "quit": 1, 
            "rear": 1, 
            "spade": 1, 
            "sprung": 1, 
            "step": 1, 
            "type": 1, 
            "use": 1, 
            "wo": 1
        }

vocab_38202 = {
            "caus": 1, 
            "ear": 1, 
            "energi": 1, 
            "experi": 1, 
            "inner": 1, 
            "light": 1, 
            "nervous": 1, 
            "part": 1, 
            "qualiti": 1, 
            "reach": 1, 
            "receptor": 2, 
            "retin": 1, 
            "sensori": 1, 
            "stimul": 2, 
            "system": 2, 
            "upon": 2,
            "wo": 1     ### added so they have at least one common word
        }

讓我們把它轉換為特征向量。 枚舉所有單詞並標記該單詞在詞匯表中的次數。

from collections import defaultdict
words = []

def get_components(vect_dict):
    vect_components = defaultdict(int)
    for word, num in vect_dict.items():
        try:
           ind = words.index(word)
        except ValueError:
           ind = len(words)
           words.append(word)
        vect_components[ind] += num
    return vect_components


#  
vect_comps_38201 = get_components(vocab_38201)
vect_comps_38202 = get_components(vocab_38202)

我們看看吧：

>>> print(vect_comps_38201)
defaultdict(<class 'int'>, {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 2, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1})

>>> print(vect_comps_38202)
defaultdict(<class 'int'>, {32: 1, 33: 2, 34: 1, 7: 1, 19: 2, 20: 2, 21: 1, 22: 1, 23: 1, 24: 1, 25: 1, 26: 1, 27: 2, 28: 1, 29: 1, 30: 1, 31: 1})

>>> vect_38201=[vect_comps_38201.get(i,0) for i in range(len(words))]
>>> print(vect_38201)
[1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

>>> vect_38202=[vect_comps_38202.get(i,0) for i in range(len(words))]
>>> print(vect_38202)
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1]

這些vect_38201和vect38202是您可以在擬合模型中使用的向量：

from sklearn.svm import SVC
X = [vect_38201, vect_38202]
y = [38201, 38202]
clf = SVC()
clf.fit(X, y)
clf.predict([[0, 0, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 2, 1]])

輸出：

array([38202])

當然這是一個非常簡單的例子，只是展示了這個概念。

你能做些什么來改善它？

標准化矢量坐標。
使用優秀的工具Tf-Idf矢量化器從文本中提取數據特征。
添加更多數據。

祝好運！

Answer 2

下一步 - 實現多維線性分類器。

不幸的是我沒有訪問這個數據庫，所以這有點理論上。 我可以提出這種方法：

將所有數據強制轉換為一個CSV文件，如下所示：

SenseId,Word,Text,IsHyponim,Properties,Attribute1,Attribute2, ...
30821,"BNC","For neurophysiologists and ...","Hyponym sometype",1,1
30822,"BNC","Do you know what it is ...","Antonym type",0,1
...

接下來，您可以使用sklearn工具：

import pandas as pd
df.read_csv('file.csv')

from sklearn.feature_extraction import DictVectorizer
enc=DictVectorizer()
X_train_categ = enc.fit_transform(df[['Properties',]].to_dict('records'))

from sklearn.feature_extraction.text import TfidfVectorizer
vec=TfidfVectorizer(min_df=5)  # throw out all terms which present in less than 5 documents - typos and so on
v=vec.fit_transform(df['Text'])

# Join all date together as a sparsed matrix
from scipy.sparse import csr_matrix, hstack
train=hstack( (csr_matrix(df.ix[:, 'Word':'Text']), X_train_categ, v))
y = df['SenseId']

# here you have an matrix with really huge dimensionality - about dozens of thousand columns 
# you may use Ridge regression to deal with it:
from sklearn.linear_model import Ridge
r=Ridge(random_state=241, alpha=1.0)

# prepare test data like training one

更多細節： Ridge ， Ridge Classifier 。

處理高維度問題的其他技術。

使用稀疏特征矩陣進行文本分類的代碼示例。

在scikit中實現K鄰居分類器和線性SVM - 學習單詞感知消歧

問題描述

2 個解決方案

解決方案1
2 2016-06-06 17:13:25

解決方案2
2 2016-06-08 12:47:22

在scikit中實現K鄰居分類器和線性SVM - 學習單詞感知消歧

問題描述

2 個解決方案

解決方案1 2 2016-06-06 17:13:25

解決方案2 2 2016-06-08 12:47:22

解決方案1
2 2016-06-06 17:13:25

解決方案2
2 2016-06-08 12:47:22