scikit-learn 错误：y 中人口最少的类只有 1 个成员

Question

I'm trying to split my dataset into a training and a test set by using the train_test_split function from scikit-learn, but I'm getting this error:我正在尝试使用 scikit-learn 中的train_test_split函数将我的数据集拆分为训练集和测试集，但出现此错误：

In [1]: y.iloc[:,0].value_counts()
Out[1]: 
M2    38
M1    35
M4    29
M5    15
M0    15
M3    15

In [2]: xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85, stratify=y)
Out[2]: 
Traceback (most recent call last):
  File "run_ok.py", line 48, in <module>
    xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=1/3,random_state=85,stratify=y)
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1700, in train_test_split
    train, test = next(cv.split(X=arrays[0], y=stratify))
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 953, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1259, in _iter_indices
    raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

However, all classes have at least 15 samples.但是，所有类别至少有 15 个样本。 Why am I getting this error?为什么会出现此错误？

X is a pandas DataFrame which represents the data points, y is a pandas DataFrame with one column that contains the target variable. X 是一个代表数据点的 pandas DataFrame，y 是一个 pandas DataFrame，其中一列包含目标变量。

I cannot post the original data because it's proprietary, but it is fairly reproducible by creating a random pandas DataFrame (X) with 1k rows x 500 columns, and a random pandas DataFrame (y) with the same number of rows (1k) of X, and, for each row the target variable (a categorical label).我不能发布原始数据，因为它是专有的，但是通过创建一个具有 1k 行 x 500 列的随机 pandas DataFrame (X) 和一个具有相同行数 (1k) X 的随机 pandas DataFrame (y)，它是相当可重现的, 以及每一行的目标变量（分类标签）。 The y pandas DataFrame should have different categorical labels (eg 'class1', 'class2'...) and each labels should have at least 15 occurrences. y pandas DataFrame 应具有不同的分类标签（例如“class1”、“class2”...），并且每个标签应至少出现 15 次。

Answer 1

在拆分训练和测试数据时删除stratify=y

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85)

Answer 2

The problem was that train_test_split takes as input 2 arrays, but the y array is a one-column matrix.问题是train_test_split将 2 个数组作为输入，但y数组是一列矩阵。 If I pass only the first column of y it works.如果我只通过y的第一列，它会起作用。

train, xtest, ytrain, ytest = train_test_split(X, y.iloc[:,1], test_size=1/3,
  random_state=85, stratify=y.iloc[:,1])

Answer 3

The main point is if you use stratified CV, then you will get this warning if the number of splits cannot produce all CV splits with the same ratio of all classes in the data.主要的一点是，如果您使用分层 CV，那么如果拆分的数量无法产生具有相同数据中所有类别比率的所有 CV 拆分，您将收到此警告。 Eg if you have 2 samples of one class, there will be 2 CV sets with 2 samples of this class, and 3 CV sets with 0 samples, hence the ratio samples for this class does not equal in all CV sets.例如，如果您有一个类别的 2 个样本，则将有 2 个 CV 集，其中包含 2 个此类样本，而 3 个 CV 集包含 0 个样本，因此此类的比率样本在所有 CV 集中并不相等。 But the problem is only if there is 0 samples in any of the sets, so if you have at least as many samples as the number of CV splits, ie 5 in this case, this warning won't appear.但问题仅在于任何集合中的样本数为 0 时，因此如果您的样本数至少与 CV 拆分的数量一样多，即在这种情况下为 5，则不会出现此警告。

See https://stackoverflow.com/a/48314533/2340939 .请参阅https://stackoverflow.com/a/48314533/2340939 。

Answer 4

试试这种方式，它对我有用，这里也提到了：

x_train, x_test, y_train, y_test = train_test_split(data_x,data_y,test_size=0.33, random_state=42) .

Answer 5

Continuing with user2340939's answer .继续user2340939 的回答。 If you really need your train-test splits to be stratified despite the less number of rows in certain class, you can try using the following method.如果您确实需要在某些类中的行数较少的情况下对训练测试拆分进行分层，则可以尝试使用以下方法。 I generally use the same, where I'll make a copy of all the rows of such classes to both the train and test datasets..我通常使用相同的方法，我会将此类类的所有行复制到训练和测试数据集。

from sklearn.model_selection import train_test_split

def get_min_required_rows(test_size=0.2):
    return 1 / test_size

def make_stratified_splits(df, y_col="label", test_size=0.2):
    """
        for any class with rows less than min_required_rows corresponding to the input test_size,
        all the rows associated with the specific class will have a copy in both the train and test splits.
        
        example: if test_size is 0.2 (20% otherwise),
        min_required_rows = 5 (which is obtained from 1 / test_size i.e., 1 / 0.2)
        where the resulting splits will have 4 train rows (80%), 1 test row (20%)..
    """
    
    id_col = "id"
    temp_col = "same-class-rows"
    
    class_to_counts = df[y_col].value_counts()
    df[temp_col] = df[y_col].apply(lambda y: class_to_counts[y])
    
    min_required_rows = get_min_required_rows(test_size)
    copy_rows = df[df[temp_col] < min_required_rows].copy(deep=True)
    valid_rows = df[df[temp_col] >= min_required_rows].copy(deep=True)
    
    X = valid_rows[id_col].tolist()
    y = valid_rows[y_col].tolist()
    
    # notice, this train_test_split is a stratified split
    X_train, X_test, _, _ = train_test_split(X, y, test_size=test_size, random_state=43, stratify=y)
    
    X_test = X_test + copy_rows[id_col].tolist()
    X_train = X_train + copy_rows[id_col].tolist()
    
    df.drop([temp_col], axis=1, inplace=True)
    
    test_df = df[df[id_col].isin(X_test)].copy(deep=True)
    train_df = df[df[id_col].isin(X_train)].copy(deep=True)
    
    print (f"number of rows in the original dataset: {len(df)}")
    
    test_prop = round(len(test_df) / len(df) * 100, 2)
    train_prop = round(len(train_df) / len(df) * 100, 2)
    print (f"number of rows in the splits: {len(train_df)} ({train_prop}%), {len(test_df)} ({test_prop}%)")
    
    return train_df, test_df

Answer 6

Remove stratify.去除分层。

stratify=y

should only be used in case of classification problems, so that various output classes (say 'good', 'bad') can get equally distributed among train and test data.应该只在分类问题的情况下使用，以便各种输出类别（比如“好”、“坏”）可以在训练和测试数据之间平均分配。 It is a sampling method in statistics.它是统计学中的一种抽样方法。 We should avoid using stratify in regression problems.我们应该避免在回归问题中使用分层。 The below code should work下面的代码应该工作

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85)

Answer 7

I had this issue because some of my things to be split were lists, and some were arrays.我有这个问题，因为我要拆分的一些东西是列表，有些是数组。 When I converted the arrays to a list, it worked.当我将数组转换为列表时，它起作用了。

Answer 8

I have the same problem.我也有同样的问题。 Some of class has one or two items.(My problem is multi class problem).有些班级有一个或两个项目。（我的问题是多班级问题）。 You can remove or union classes that has less items.您可以删除或合并项目较少的类。 I solve my problem like that.我就是这样解决我的问题的。

Answer 9

from sklearn.model_selection import train_test_split

all_keys = df['Key'].unique().tolist()

t_df = pd.DataFrame()
c_df = pd.DataFrame()

for key in all_keys:
    print(key)
    if df.loc[df['Key']==key].shape[0] < 2 :
        t_df = t_df.append(df.loc[df['Key']==key])
    else:
        df_t, df_c = train_test_split(df.loc[df['Key']==key],test_size=0.2,stratify=df.loc[df['Key']==key]['Key'])
        t_df = t_df.append(df_t)
        c_df = c_df.append(df_c)

Answer 10

when you use stratify=y, combine the less number of categories under one category for example: filter the labels less than 50 and label them as one single category like "others" or any name then the least populated class error will be solved.当您使用 stratify=y 时，将数量较少的类别组合在一个类别下，例如：过滤小于 50 的标签并将它们标记为一个类别，如“其他”或任何名称，然后将解决人口最少的类别错误。

Answer 11

Do you like "functional" programming?你喜欢“函数式”编程吗？ Like confusing your co-workers, and writing everything in one line of code?喜欢混淆你的同事，并在一行代码中编写所有内容？ Are you the type of person who loves nested ternary operators, instead of 2 'if' statements?您是喜欢嵌套三元运算符而不喜欢 2 个“if”语句的人吗？ Are you an Elixir programmer trapped in a Python programmer's body?你是被困在 Python 程序员身体里的 Elixir 程序员吗？

If so, the following solution may work for you.如果是这样，以下解决方案可能适合您。 It allows you to discover how many members the least-populated class has, in real-time, then adjust your cross-validation value on the fly:它允许您实时发现人口最少的班级有多少成员，然后即时调整您的交叉验证值：

""" Let's say our dataframe is like this, for example:
 
    dogs         weight     size
    ----         ----       ----
    Poodle       14         small
    Maltese      13         small
    Shepherd     45         big
    Retriever    41         big
    Burmese      43         big

The 'least populated class' would be 'small', as it only has 2 members.
If we tried doing more than 2-fold cross validation on this, the results
would be skewed.
"""

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

X = df['weight']
y = df['size']

# Random forest classifier, to classify dogs into big or small
model = RandomForestClassifier()

# Find the number of members in the least-populated class, THIS IS THE LINE WHERE THE MAGIC HAPPENS :)
leastPopulated = [x for d in set(list(y)) for x in list(y) if x == d].count(min([x for d in set(list(y)) for x in list(y) if x == d], key=[x for d in set(list(y)) for x in list(y) if x == d].count))

# I want to know the F1 score at each fold of cross validation.
# This 'fOne' variable will be a list of the F1 score from each fold
fOne = cross_val_score(model, X, y, cv=leastPopulated, scoring='f1_weighted')

# We print the F1 score here
print(f"Average F1 score during cross-validation: {np.mean(fOne)}")

scikit-learn 错误：y 中人口最少的类只有 1 个成员

问题描述

11 个解决方案

解决方案1
7 2020-04-03 23:42:31

解决方案2
5 2017-04-03 09:36:11

解决方案3
2 2020-06-25 20:02:57

解决方案4
1 2020-02-14 23:53:57

解决方案5
0 2021-05-11 12:19:03

解决方案6
0 2021-08-16 19:04:45

解决方案7
0 2021-09-06 23:27:44

解决方案8
0 2021-11-23 13:29:00

解决方案9
0 2022-02-10 15:22:04

解决方案10
0 2022-06-20 07:44:05

解决方案11
0 2022-12-15 05:47:57

scikit-learn 错误：y 中人口最少的类只有 1 个成员

问题描述

11 个解决方案

解决方案1 7 2020-04-03 23:42:31

解决方案2 5 2017-04-03 09:36:11

解决方案3 2 2020-06-25 20:02:57

解决方案4 1 2020-02-14 23:53:57

解决方案5 0 2021-05-11 12:19:03

解决方案6 0 2021-08-16 19:04:45

解决方案7 0 2021-09-06 23:27:44

解决方案8 0 2021-11-23 13:29:00

解决方案9 0 2022-02-10 15:22:04

解决方案10 0 2022-06-20 07:44:05

解决方案11 0 2022-12-15 05:47:57

解决方案1
7 2020-04-03 23:42:31

解决方案2
5 2017-04-03 09:36:11

解决方案3
2 2020-06-25 20:02:57

解决方案4
1 2020-02-14 23:53:57

解决方案5
0 2021-05-11 12:19:03

解决方案6
0 2021-08-16 19:04:45

解决方案7
0 2021-09-06 23:27:44

解决方案8
0 2021-11-23 13:29:00

解决方案9
0 2022-02-10 15:22:04

解决方案10
0 2022-06-20 07:44:05

解决方案11
0 2022-12-15 05:47:57