简体   繁体   English

组合不同类型的功能(文本分类)

[英]Combine different types of features (Text classification)

I'm doing text classification task I faced a problem. 我在执行文本分类任务时遇到了问题。 I've already selected the 1000 best feature collection using bag-of-words approach. 我已经使用词袋方法选择了1000个最佳功能集。 Now I want use another features based on Part-of-Speech, average word length etc. After I want to combine these features together. 现在,我想使用基于词性,平均单词长度等的其他功能。之后,我想将这些功能组合在一起。 How can I achieve it I'm using Python, NLTK, Scikit packages. 我如何使用Python,NLTK,Scikit软件包来实现它。 This is my first python project so code maybe not very good. 这是我的第一个python项目,因此代码可能不是很好。

Thanks in advance, 提前致谢,

    import nltk
    from nltk.corpus.reader import CategorizedPlaintextCorpusReader
    from sklearn.feature_extraction.text import TfidfVectorizer
    import os
    import numpy as np
    import random
    import pickle
    from time import time
    from sklearn import metrics

    from nltk.classify.scikitlearn import SklearnClassifier
    from sklearn.naive_bayes import MultinomialNB,BernoulliNB
    from sklearn.linear_model import LogisticRegression,SGDClassifier
    from sklearn.svm import SVC, LinearSVC, NuSVC

    import matplotlib.pyplot as plt

    def intersect(a, b, c, d):
        return list(set(a) & set(b)& set(c)& set(d))

    def find_features(document, feauture_list):
        words = set(document)
        features = {}
        for w in feauture_list:
            features[w] = (w in words)
        return features


    def benchmark(clf, name, training_set, testing_set):

        print('_' * 80)
        print("Training: ")
        print(clf)
        t0 = time()
        clf.train(training_set)
        train_time = time() - t0
        print("train time: %0.3fs" % train_time)

        t0 = time()
        score = nltk.classify.accuracy(clf, testing_set)*100
        #pred = clf.predict(testing_set)
        test_time = time() - t0

        print("test time:  %0.3fs" % test_time)

        print("accuracy:   %0.3f" % score)
        clf_descr = name
        return clf_descr, score, train_time, test_time

        #print((find_features(corpus.words('fantasy/1077-0_fantasy.txt'),feature_list)))
    path = 'c:/data/books-Copy'
    os.chdir(path)
         #need this if you want to save tfidf_matrix
    corpus = CategorizedPlaintextCorpusReader(path, r'.*\.txt', 
                                                  cat_pattern=r'(\w+)/*')
    save_featuresets = open(path +"/features_500.pickle","rb")
    featuresets = []
    featuresets = pickle.load(save_featuresets)
    save_featuresets.close()

    documents = [(list(corpus.words(fileid)), category)
                 for category in corpus.categories()
                 for fileid in corpus.fileids(category)]

    random.shuffle(documents)

    tf = TfidfVectorizer(analyzer='word', min_df = 1, 
                         stop_words = 'english', sublinear_tf=True)
    #documents_tfidf = []
    top_features = []
    tf = TfidfVectorizer(input= 'filename', analyzer='word',
                         min_df = 1, stop_words = 'english', sublinear_tf=True)

    for category in corpus.categories():
        files = corpus.fileids(category)
        tf.fit_transform( files )
        feature_names = tf.get_feature_names()
        #documents_tfidf.append(feature_names)
        indices = np.argsort(tf.idf_)[::-1]
        top_features.append([feature_names[i] for i in indices[:10000]])
        #print(top_features_detective)

    feature_list = list( set(top_features[0][:500]) | set(top_features[1][:500]) | 
                         set(top_features[2][:500])  | set(top_features[3][:500]) | 
                         set(intersect(top_features[0], top_features[1], top_features[2], top_features[3])))


    featuresets = [(find_features(rev, feature_list), category) for (rev, category) in documents]  
training_set = featuresets[:50]
testing_set =  featuresets[20:]
results = []
for clf, name in (
                          (SklearnClassifier(MultinomialNB()), "MultinomialNB"),
                          (SklearnClassifier(BernoulliNB()),   "BernoulliNB"),
                          (SklearnClassifier(LogisticRegression()), "LogisticRegression"),
                          (SklearnClassifier(SVC()),   "SVC"),
                          (SklearnClassifier(LinearSVC()),   "Linear SVC "),
                          (SklearnClassifier(SGDClassifier()),   "SGD ")):
    print(name)
    results.append(benchmark(clf, name, training_set, testing_set))

indices = np.arange(len(results))
results = [[x[i] for x in results] for i in range(4)]

clf_names, score, training_time, test_time = results
training_time = np.array(training_time) / np.max(training_time)
test_time = np.array(test_time) / np.max(test_time)



plt.figure(figsize=(12, 8))
plt.title("Score")
plt.barh(indices, score, .2, label="score", color='navy')
plt.barh(indices + .3, training_time, .2, label="training time",
                 color='c')
plt.barh(indices + .6, test_time, .2, label="test time", color='darkorange')
plt.yticks(())
plt.legend(loc='best')
plt.subplots_adjust(left=.25)
plt.subplots_adjust(top=.95)
plt.subplots_adjust(bottom=.05)

for i, c in zip(indices, clf_names):
    plt.text(-15.6, i, c)
    plt.show() 

There is nothing wrong with combining features of different types (in fact it's generally a good idea for classification tasks). 组合不同类型的功能没有错(事实上,对于分类任务,这通常是个好主意)。 The NLTK's API expects features to come in a dictionary, so you just need to combine your feature collections into a single dictionary. NLTK的API希望功能会包含在词典中,因此您只需要将功能集组合到一个词典中即可。

This is the answer to the question you asked. 这是您所提问题的答案。 If there is a problem with your code which you need help with but did not ask about, you should probably start a new question. 如果您的代码存在问题,需要帮助但又没有询问,您可能应该开始一个新问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM