![](/img/trans.png)
[英]ValueError: n_splits=10 cannot be greater than the number of members in each class
[英]How to fix the ValueError: n_splits=10 Error sklearn NLP
我第一次嘗試進行multi class classification
,我第一次使用 scikit-learn,我在網上找到了這個代碼並試圖將它用於我的數據
我的數據看起來像這樣
id Text Tags
----------------------------------------------------------------------------
1 Tears made her vision blur again blue
2 She looked away, outside, at the blur of snow as he continued. blue
3 Mr. Green, you are wanted on the phone green
4 I prefer oranges to apples orange
5 Tom drank his orange juice black
這是我的代碼
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
df = pd.read_csv('./dataSet03.csv')
col = ['Text', 'Tags']
data = df[col]
data.columns =['Text', 'Tags']
df['id'] = df['Tags'].factorize()[0]
product_id_data = df[['Tags', 'id']].drop_duplicates().sort_values('id')
product_to_id = dict(product_id_data.values)
id_to_product = dict(product_id_data[['id', 'Tags']].values)
tfidf = TfidfVectorizer(sublinear_tf=True,
min_df=5,
norm='l2',
encoding='latin-1',
ngram_range=(1, 2),
stop_words='english')
features = tfidf.fit_transform(df.Text).toarray()
labels = df.id
X_train, X_test, y_train, y_test = train_test_split(df.Text, df.Tags, random_state=0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)
models = [
RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
LinearSVC(),#Linear Support Vector Classification.
MultinomialNB(),#Naive Bayes classifier for multinomial models
LogisticRegression(random_state=0),
]
CV = 10
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
model_name = model.__class__.__name__
accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
print(cv_df.groupby('model_name').accuracy.mean())
我的代碼到達此行時出現此錯誤
accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
這是錯誤
ValueError: n_splits=10 cannot be greater than the number of members in each class.
您正在使用id
作為訓練 label,這看起來像是您示例中的一個獨特條目,所以這完全沒有意義。 您將擁有與觀察次數一樣多的 class。
您很可能想使用Tags
,下面是一個示例:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
'Text':['Tears made her vision'
'blur again, She looked',
'away, outside',
'at the blur of snow',
'as he continued.',
'Mr. Green, you are',
'wanted on the phone',
'I prefer oranges',
'to apples',
'Tom drank his',
'orange juice'],
'Tags':['blue','blue','green','orange','green','orange','blue','green','orange','blue']
})
使用 CV=3 運行代碼:
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(df.Text).toarray()
labels = df.Tags
models = [
RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
LinearSVC(),#Linear Support Vector Classification.
MultinomialNB(),#Naive Bayes classifier for multinomial models
LogisticRegression(random_state=0),
]
CV = 3
#cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
model_name = model.__class__.__name__
accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
print(cv_df.groupby('model_name').accuracy.mean())
在此示例中,沒有足夠的數據運行 CV=10,但只要每個 class 至少有 10 個成員,您就可以運行 CV=10。 上面的代碼給出了這個 output:
model_name
LinearSVC 0.222222
LogisticRegression 0.222222
MultinomialNB 0.388889
RandomForestClassifier 0.388889
Name: accuracy, dtype: float64
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.