用n-gram分类

Question

我想使用使用n-gram功能的sklearn分类器。 此外，我想进行交叉验证以找到n-gram的最佳顺序。 但是，我对如何将所有部件组装在一起感到有些困惑。

现在，我有以下代码：

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

text = ... # This is the input text. A list of strings
labels = ... # These are the labels of each sentence
# Find the optimal order of the ngrams by cross-validation
scores = pd.Series(index=range(1,6), dtype=float)
folds = KFold(n_splits=3)

for n in range(1,6):
    count_vect = CountVectorizer(ngram_range=(n,n), stop_words='english')
    X = count_vect.fit_transform(text)
    X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.33, random_state=42)
    clf = MultinomialNB()
    score = cross_val_score(clf, X_train, y_train, cv=folds, n_jobs=-1)
    scores.loc[n] = np.mean(score)

# Evaluate the classifier using the best order found
order = scores.idxmax()
count_vect = CountVectorizer(ngram_range=(order,order), stop_words='english')
X = count_vect.fit_transform(text)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.33, random_state=42)
clf = MultinomialNB()
clf = clf.fit(X_train, y_train)
acc = clf.score(X_test, y_test)
print('Accuracy is {}'.format(acc))

但是，我觉得这样做是错误的方法，因为我在每个循环中都创建了一个训练测试拆分。

如果预先进行火车测试拆分并将CountVectorizer应用于两个零件，则这些零件的shape s会不同，这会在使用clf.fit和clf.score时引起问题。

我该如何解决？

编辑：如果我尝试先建立一个词汇表，我仍然必须建立几个词汇表，因为unigram的词汇表与bigrams的词汇表等不同。

举个例子：

# unigram vocab
vocab = set()
for sentence in text:
    for word in sentence:
        if word not in vocab:
            vocab.add(word)
len(vocab) # 47291

# bigram vocab
vocab = set()
for sentence in text:
    bigrams = nltk.ngrams(sentence, 2)
    for bigram in bigrams:
        if bigram not in vocab:
            vocab.add(bigram)
len(vocab) # 326044

这再次导致我CountVectorizer同样的问题，即需要为每个n克大小应用CountVectorizer 。

Answer 1

您需要首先设置vocabulary参数。 您必须以某种方式提供整个词汇表，否则尺寸将永远无法匹配（很明显）。 如果先进行训练/测试拆分，则一组中可能不存在其他组中的单词，从而导致尺寸不匹配。

该文件说：

如果您不提供先验字典，并且不使用进行某种特征选择的分析器，则特征的数量将等于通过分析数据发现的词汇量。

在下面，您将找到vocabulary的描述。

词汇：
映射或可迭代，可选映射（例如dict），其中键是项，值是特征矩阵中的索引，或者是可迭代的项。 如果未给出，则根据输入文档确定词汇表。 映射中的索引不应重复，并且0与最大索引之间不应有任何间隙。

用n-gram分类

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-06-02 17:21:22

用n-gram分类

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-06-02 17:21:22

解决方案1
1 已采纳 2017-06-02 17:21:22