简体   繁体   English

如何使用 sklearn 中的 train_test_split 确保用户和项目同时出现在训练和测试数据集中?

[英]How can I ensure that the users and items appear in both train and test data set with train_test_split in sklearn?

I have a data set including user ID , item ID , and rating as below:我有一个数据集,包括user IDitem IDrating ,如下所示:

user ID     item ID    rating
 1233        1011       4
 1220        0999       3
 2011        0702       1
 ...

When I split them into train and test sets:当我将它们分成train集和test集时:

from sklearn import cross_validation

train, test = cross_validation.train_test_split(df, test_size = 0.2)

Whether the users in test set have already appeared in the train set, and so have items?测试集中的用户是否已经出现在训练集中,还有物品? If not, how can I do that?如果没有,我该怎么做? I can not find the answer in document .我在文档中找不到答案。 Could you please tell me?你能告诉我吗?

If you want to ensure that your training and test partitions don't contain the same pairings of user and item then you could replace each unique (user, item) combination with an integer label, then pass these labels to LabelKFold .如果你想确保你的训练和测试分区不包含相同的用户和项目配对,那么你可以用一个整数标签替换每个唯一的(用户,项目)组合,然后将这些标签传递给LabelKFold To assign integer labels to each unique pairing you could use this trick :要为每个唯一的配对分配整数标签,您可以使用以下技巧

import numpy as np
import pandas as pd
from sklearn.cross_validation import LabelKFold

df = pd.DataFrame({'users':[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
                   'items':[0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
                   'ratings':[2, 4, 3, 1, 4, 3, 0, 0, 0, 1, 0, 1]})

users_items = df[['users', 'items']].values
d = np.dtype((np.void, users_items.dtype.itemsize * users_items.shape[1]))
_, uidx = np.unique(np.ascontiguousarray(users_items).view(d), return_inverse=True)

for train, test in LabelKFold(uidx):

    # train your classifier using df.loc[train, ['users', 'items']] and
    # df.loc[train, 'ratings']...

    # cross-validate on df.loc[test, ['users', 'items']] and
    # df.loc[test, 'ratings']...

I'm still having a hard time understanding your question.我仍然很难理解你的问题。 If you want to guarantee that your training and test sets do contain examples of the same user then you could use StratifiedKFold :如果您想保证您的训练和测试集确实包含同一用户的示例,那么您可以使用StratifiedKFold

for train, test in StratifiedKFold(df['users']):
    # ...
def train_test_split(self, ratings, train_rate=0.8):
        """
        Split ratings into Training set and Test set

        """
        grps = ratings.groupby('user_id').groups
        test_df_index = list()
        train_df_index = list()

        test_iid = list()
        train_iid = list()

        for key in grps:
            count = 0
            local_index = list()
            grp = np.array(list(grps[key]))

            n_test = int(len(grp) * (1 - train_rate))
            for i, index in enumerate(grp):
                if count >= n_test:
                    break
                if ratings.iloc[index]['movie_id'] in test_iid:
                    continue
                test_iid.append(ratings.iloc[index]['movie_id'])
                test_df_index.append(index)
                local_index.append(i)
                count += 1

            grp = np.delete(grp, local_index)

            if count < n_test:
                local_index = list()
                for i, index in enumerate(grp):
                    if count >= n_test:
                        break
                    test_iid.append(ratings.iloc[index]['movie_id'])
                    test_df_index.append(index)
                    local_index.append(i)
                    count += 1

                grp = np.delete(grp, local_index)

            train_df_index.append(grp)

        test_df_index = np.hstack(np.array(test_df_index))
        train_df_index = np.hstack(np.array(train_df_index))

        np.random.shuffle(test_df_index)
        np.random.shuffle(train_df_index)

        return ratings.iloc[train_df_index], ratings.iloc[test_df_index]

You can use this method to split, I've already done efforts to make sure that the training set and test set have the same user id and movie id.可以用这个方法进行拆分,我已经努力确保训练集和测试集具有相同的用户ID和电影ID。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在不使用 function train_test_split 的情况下将数据拆分为测试和训练? - How can I split the data into test and train without using function train_test_split? 如何在不使用train_test_split()的情况下拆分数据集? - How to split the data set without train_test_split()? Python Sklearn train_test_split():如何设置要训练的数据? - Python Sklearn train_test_split(): how to set Which Data is Taken for Training? sklearn train_test_split 在 pandas - sklearn train_test_split on pandas 如何使用train_test_split将未标记的数据拆分为训练集和测试集? - How to split unlabeled data into train and test set using train_test_split? 使用来自 sklearn 的 train_test_split 错误拆分数据 - error splitting data using the train_test_split from sklearn 如何在sklearn中获得一个非混乱的train_test_split - How to get a non-shuffled train_test_split in sklearn Python (sklearn) train_test_split:选择要训练的数据和要测试的数据 - Python (sklearn) train_test_split: choosing which data to train and which data to test 如何使用 Python Numpy 中的 train_test_split 将数据拆分为训练、测试和验证数据集? 分裂不应该是随机的 - How to split data by using train_test_split in Python Numpy into train, test and validation data set? The split should not random 带有test_size = 0的train_test_split如何影响数据? - How is train_test_split with test_size=0 affecting the data?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM