簡體   English   中英

如何修復 ValueError:n_splits=10 錯誤 sklearn NLP

[英]How to fix the ValueError: n_splits=10 Error sklearn NLP

我第一次嘗試進行multi class classification ,我第一次使用 scikit-learn,我在網上找到了這個代碼並試圖將它用於我的數據
我的數據看起來像這樣

id                      Text                                           Tags
----------------------------------------------------------------------------
1    Tears made her vision blur again                                  blue
2    She looked away, outside, at the blur of snow as he continued.    blue
3    Mr. Green, you are wanted on the phone                            green
4    I prefer oranges to apples                                        orange
5    Tom drank his orange juice                                        black

這是我的代碼

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

df = pd.read_csv('./dataSet03.csv')
col = ['Text', 'Tags']
data = df[col]
data.columns =['Text', 'Tags']
df['id'] = df['Tags'].factorize()[0]
product_id_data = df[['Tags', 'id']].drop_duplicates().sort_values('id')
product_to_id = dict(product_id_data.values)
id_to_product = dict(product_id_data[['id', 'Tags']].values)
tfidf = TfidfVectorizer(sublinear_tf=True, 
                        min_df=5, 
                        norm='l2', 
                        encoding='latin-1', 
                        ngram_range=(1, 2),
                        stop_words='english')
features = tfidf.fit_transform(df.Text).toarray()
labels = df.id
X_train, X_test, y_train, y_test = train_test_split(df.Text, df.Tags, random_state=0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)
models = [
    RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
    LinearSVC(),#Linear Support Vector Classification.
    MultinomialNB(),#Naive Bayes classifier for multinomial models
    LogisticRegression(random_state=0),
]
CV = 10
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
    model_name = model.__class__.__name__
    accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
    for fold_idx, accuracy in enumerate(accuracies):
        entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
print(cv_df.groupby('model_name').accuracy.mean())

我的代碼到達此行時出現此錯誤

accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)

這是錯誤

ValueError: n_splits=10 cannot be greater than the number of members in each class.

您正在使用id作為訓練 label,這看起來像是您示例中的一個獨特條目,所以這完全沒有意義。 您將擁有與觀察次數一樣多的 class。

您很可能想使用Tags ,下面是一個示例:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
                   'Text':['Tears made her vision'
                           'blur again, She looked', 
                           'away, outside',
                           'at the blur of snow',
                           'as he continued.',
                           'Mr. Green, you are',
                           'wanted on the phone',
                           'I prefer oranges',
                           'to apples',
                           'Tom drank his',
                           'orange juice'],
                   'Tags':['blue','blue','green','orange','green','orange','blue','green','orange','blue']
                  })

使用 CV=3 運行代碼:

tfidf = TfidfVectorizer()
features = tfidf.fit_transform(df.Text).toarray()
labels = df.Tags

models = [
    RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
    LinearSVC(),#Linear Support Vector Classification.
    MultinomialNB(),#Naive Bayes classifier for multinomial models
    LogisticRegression(random_state=0),
]
CV = 3
#cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
    model_name = model.__class__.__name__
    accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
    for fold_idx, accuracy in enumerate(accuracies):
        entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
print(cv_df.groupby('model_name').accuracy.mean())

在此示例中,沒有足夠的數據運行 CV=10,但只要每個 class 至少有 10 個成員,您就可以運行 CV=10。 上面的代碼給出了這個 output:

model_name
LinearSVC                 0.222222
LogisticRegression        0.222222
MultinomialNB             0.388889
RandomForestClassifier    0.388889
Name: accuracy, dtype: float64

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM