簡體   English   中英

如何在管道中重新取樣文本(不平衡組)?

[英]How to resample text (imbalanced groups) in a pipeline?

我正在嘗試使用MultinomialNB進行一些文本分類,但我遇到了問題,因為我的數據不平衡。 (為簡單起見,下面是一些示例數據。實際上,我的數據要大得多。)我正在嘗試使用過采樣對數據進行重新采樣,我希望將其構建到此管道中。

下面的管道工作正常,沒有過采樣,但在現實生活中我的數據再次需要它。 這是非常不平衡的。

使用當前代碼,我不斷收到錯誤:“TypeError:所有中間步驟應該是變換器並實現適合和轉換。”

如何將RandomOverSampler構建到此管道中?

data = [['round red fruit that is sweet','apple'],['long yellow fruit with a peel','banana'],
    ['round green fruit that is soft and sweet','pear'], ['red fruit that is common', 'apple'],
    ['tiny fruits that grow in bunches','grapes'],['purple fruits', 'grapes'], ['yellow and long', 'banana'],
    ['round, small, green', 'grapes'], ['can be red, green, or purple', 'grapes'], ['tiny fruits', 'grapes'], 
    ['small fruits', 'grapes']]

df = pd.DataFrame(data,columns=['Description','Type'])  

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()), 
                    ('RUS', RandomOverSampler()),
                    ('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)

print('Score:',text_clf.score(X_test, y_test))

您應該使用在實施管道imblearn包,而不是從一個sklearn 例如,這段代碼運行良好:

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline


data = [['round red fruit that is sweet','apple'],['long yellow fruit with a peel','banana'],
    ['round green fruit that is soft and sweet','pear'], ['red fruit that is common', 'apple'],
    ['tiny fruits that grow in bunches','grapes'],['purple fruits', 'grapes'], ['yellow and long', 'banana'],
    ['round, small, green', 'grapes'], ['can be red, green, or purple', 'grapes'], ['tiny fruits', 'grapes'],
    ['small fruits', 'grapes']]

df = pd.DataFrame(data, columns=['Description','Type'])

X_train, X_test, y_train, y_test = train_test_split(df['Description'],
    df['Type'], random_state=0)

text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('RUS', RandomOverSampler()),
                    ('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)

print('Score:',text_clf.score(X_test, y_test))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM