簡體   English   中英

提高多標簽分類的准確性(Scikit-learn、Keras)

[英]Improve the accuracy for multi-label classification (Scikit-learn, Keras)

我將訓練機器學習模型,將特定標簽分配給描述活動的段落。 在我的數據庫中,對於給定的描述段落(X),有幾個與之相關的相應標簽(Y)。 希望能提高分類准確率。

我通過 Scikit-learn-learn 構建了幾個機器學習模型(例如 SVC、DecisionTreeClassifier、KNeighborsClassifier、RadiusNeighborsClassifier、ExtraTreesClassifier、RandomForestClassifier、MLPClassifier、RidgeClassifierCV)和通過 Keras 構建的神經網絡模型。 使用 OneVsRestClassifier(SGDClassifier),我可以獲得的最佳准確度(苛刻的指標)是 47%。

print(X)
0        Contribution to METU HS Ankara Lab Protocols ...
1        Attend the MakerFaire in Hannover to demonstr...
2        Organize a "Biotech Day" and present the proj...
3        Contact and connect with Community Labs in Eu...
4        Invite "Technik Garage," a German Community L...
5        Present the project to the biotechnology comp...
6        Visit one of Europe's largest detergent plant...
...

print(y2)
0                                       [Community Event]
1                 [Project Presentation, Community Event]
2               [Project Presentation, Teaching Activity]
3          [Conference/Panel Discussion, Consult Experts]
4          [Conference/Panel Discussion, Consult Experts]
5       [Conference/Panel Discussion, Project Presenta...
6                                       [Consult Experts]
...

...

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb_y2 = mlb.fit_transform(y2)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, mlb_y2, test_size=0.2, random_state=52)

Scikit-learn:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier

pipe = Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),('classifier', OneVsRestClassifier(SGDClassifier(loss = 'hinge', alpha=0.00026, penalty='elasticnet', max_iter=2000,tol=0.0008, learning_rate = 'adaptive', eta0 = 0.12)))])
pipe.fit(X_train, y_train) 
print("test model score: %.3f" % pipe.score(X_test, y_test))
print("train model score: %.3f" % pipe.score(X_train, y_train))
test model score: 0.478
train model score: 0.801 (overfitting exist! I adjusted the penalty & alpha term, but it doesn't improve much. I don't know whether there is any other way to do the regulation.)

Keras:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=300, lower=True)
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)
vocab_size = len(tokenizer.word_index) + 1
x = pad_sequences(sequences, padding='post', maxlen=80)

from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding, Flatten, GlobalMaxPool1D, Dropout, Conv1D, LSTM, SpatialDropout1D
from keras.callbacks import ReduceLROnPlateau, EarlyStopping, ModelCheckpoint
from keras.losses import binary_crossentropy
from keras.optimizers import Adam
import sklearn

filter_length = 1000

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim= 70, input_length=80))
model.add(Dropout(0.1))
model.add(Conv1D(filter_length, 3, padding='valid', activation='relu', strides=1))
model.add(GlobalMaxPool1D())
#model.add(SpatialDropout1D(0.1))
#model.add(LSTM(100, dropout=0.1, recurrent_dropout=0.1))
model.add(Dense(len(mlb.classes_)))
model.add(Activation('sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['categorical_accuracy'])

callbacks = [ReduceLROnPlateau(),EarlyStopping(patience=4),
    ModelCheckpoint(filepath='model-conv1d.h5', save_best_only=True)]

history = model.fit(X_train, y_train,epochs=80,batch_size=500,
validation_split=0.1,verbose=2,callbacks=callbacks)

from keras import models
cnn_model = models.load_model('model-conv1d.h5')
from sklearn.metrics import accuracy_score
y_pred = cnn_model.predict(X_test)
accuracy_score(y_test,y_pred.round())

Out: 0.4405555555555556 (I think the neural network model has more room for improvement. But I'm not sure how to achieve that.)

我希望准確率至少達到 60%。 你們能給我一些關於改進我的 Scikit-learn 和 Keras 模型代碼的建議嗎?

更具體地說, 1. 有沒有辦法改進 OneVsRestClassifier(SGDClassifier)? 2. 有沒有辦法改進我的卷積神經網絡? 或者使用某種形式的循環神經網絡? (我嘗試了簡單的 RNN,但效果不佳。)

PS:在我計算精度的方式中,如果模型輸出 [0, 0, 0, 1, 0, 1](y_pred) 並且正確的輸出是 [0, 0, 0, 1, 0 , 0](y_test),我的准確度是 0 而不是 5/6?

這個問題很長。 非常感謝你們!

如果您有一堆弱分類器,您可以嘗試使用增強技術(例如 AdaBoost)將它們攪拌成一個強分類器。 這里有一些技巧可以嘗試:

請注意,如果您沒有足夠的訓練數據,那么您可能會得到一個過度擬合的模型。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM