简体   繁体   English

文本分类的神经网络

[英]Neural networks for Text Classification

I am trying to train a model on text classification. 我正在尝试训练文本分类模型。 I have a large labeled dataset. 我有一个大标签数据集。 I have tried scikit classifiers NaiveBayes, KNeighborsClassifier, RandomForest etc. But i cannot get an accuracy above 30%. 我已经尝试过scikit分类器NaiveBayes,KNeighborsClassifier,RandomForest等。但是我无法获得30%以上的准确性。 How can i use the Neural Networks for text classification? 如何使用神经网络进行文本分类? Here is the algo i have used so far 这是我到目前为止使用的算法

   df = read_csv(filename, sep="|", na_values=[" "]).fillna(" ")
   le = preprocessing.LabelEncoder()
   target = le.fit_transform(df['label'])

   vectorizer = TfidfVectorizer(sublinear_tf=True, 
                           max_df=0.3,
                           min_df=100,
                           lowercase=True,
                           stop_words='english', 
                           max_features=20000,
                           tokenizer=tokenize,
                           ngram_range=(1,4)
                          )

   train = vectorizer.fit_transform(df['data'])
   X_train, X_test, y_train , y_test = cross_validation.train_test_split(train, target, test_size=5000, random_state=0)
   clf = MultinomialNB(alpha=.1)
   clf.fit(X_train, y_train)
   pred = clf.predict(X_test)

My dataset contains about 300k documents, and vectorizer can produce upto 50k features. 我的数据集包含大约30万个文档,矢量化程序最多可以生成5万个特征。 I have even tried chisquare to reduce the number of features to 5k, but still accuracy does not improve much. 我什至尝试使用chisquare将功能数量减少到5k,但准确性仍然没有太大提高。

Nature of Data Documents are set of comments, notes on a incident. 数据文档的性质是一组注释,事件注释。 Labels are high level categories for the incidents. 标签是事件的高级类别。 As expected, the comments and notes are subjected to human errors, misspellings. 不出所料,注释和注释易受人为错误和拼写错误的影响。

You need to improve the quality of your features. 您需要提高功能的质量。 I suggest you form a new question around how to design features for this problem before dealing with the classifier algorithm. 我建议您在处理分类器算法之前围绕如何设计此问题的功能形成一个新问题。 From the bad accuracy you report using a few methods, and the description that should be the weakest point you address first. 由于准确性差,您使用了几种方法进行报告,而描述应该是您首先要解决的最弱点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM