如何使用 sklearn 中的 train_test_split 确保用户和项目同时出现在训练和测试数据集中？

Question

I have a data set including user ID , item ID , and rating as below:我有一个数据集，包括user ID 、 item ID和rating ，如下所示：

user ID     item ID    rating
 1233        1011       4
 1220        0999       3
 2011        0702       1
 ...

When I split them into train and test sets:当我将它们分成train集和test集时：

from sklearn import cross_validation

train, test = cross_validation.train_test_split(df, test_size = 0.2)

Whether the users in test set have already appeared in the train set, and so have items?测试集中的用户是否已经出现在训练集中，还有物品？ If not, how can I do that?如果没有，我该怎么做？ I can not find the answer in document .我在文档中找不到答案。 Could you please tell me?你能告诉我吗？

Answer 1

If you want to ensure that your training and test partitions don't contain the same pairings of user and item then you could replace each unique (user, item) combination with an integer label, then pass these labels to LabelKFold .如果你想确保你的训练和测试分区不包含相同的用户和项目配对，那么你可以用一个整数标签替换每个唯一的（用户，项目）组合，然后将这些标签传递给LabelKFold 。 To assign integer labels to each unique pairing you could use this trick :要为每个唯一的配对分配整数标签，您可以使用以下技巧：

import numpy as np
import pandas as pd
from sklearn.cross_validation import LabelKFold

df = pd.DataFrame({'users':[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
                   'items':[0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
                   'ratings':[2, 4, 3, 1, 4, 3, 0, 0, 0, 1, 0, 1]})

users_items = df[['users', 'items']].values
d = np.dtype((np.void, users_items.dtype.itemsize * users_items.shape[1]))
_, uidx = np.unique(np.ascontiguousarray(users_items).view(d), return_inverse=True)

for train, test in LabelKFold(uidx):

    # train your classifier using df.loc[train, ['users', 'items']] and
    # df.loc[train, 'ratings']...

    # cross-validate on df.loc[test, ['users', 'items']] and
    # df.loc[test, 'ratings']...

I'm still having a hard time understanding your question.我仍然很难理解你的问题。 If you want to guarantee that your training and test sets do contain examples of the same user then you could use StratifiedKFold :如果您想保证您的训练和测试集确实包含同一用户的示例，那么您可以使用StratifiedKFold ：

for train, test in StratifiedKFold(df['users']):
    # ...

Answer 2

def train_test_split(self, ratings, train_rate=0.8):
        """
        Split ratings into Training set and Test set

        """
        grps = ratings.groupby('user_id').groups
        test_df_index = list()
        train_df_index = list()

        test_iid = list()
        train_iid = list()

        for key in grps:
            count = 0
            local_index = list()
            grp = np.array(list(grps[key]))

            n_test = int(len(grp) * (1 - train_rate))
            for i, index in enumerate(grp):
                if count >= n_test:
                    break
                if ratings.iloc[index]['movie_id'] in test_iid:
                    continue
                test_iid.append(ratings.iloc[index]['movie_id'])
                test_df_index.append(index)
                local_index.append(i)
                count += 1

            grp = np.delete(grp, local_index)

            if count < n_test:
                local_index = list()
                for i, index in enumerate(grp):
                    if count >= n_test:
                        break
                    test_iid.append(ratings.iloc[index]['movie_id'])
                    test_df_index.append(index)
                    local_index.append(i)
                    count += 1

                grp = np.delete(grp, local_index)

            train_df_index.append(grp)

        test_df_index = np.hstack(np.array(test_df_index))
        train_df_index = np.hstack(np.array(train_df_index))

        np.random.shuffle(test_df_index)
        np.random.shuffle(train_df_index)

        return ratings.iloc[train_df_index], ratings.iloc[test_df_index]

You can use this method to split, I've already done efforts to make sure that the training set and test set have the same user id and movie id.可以用这个方法进行拆分，我已经努力确保训练集和测试集具有相同的用户ID和电影ID。

如何使用 sklearn 中的 train_test_split 确保用户和项目同时出现在训练和测试数据集中？

问题描述

2 个解决方案

解决方案1
0 2016-02-17 12:38:19

解决方案2
0 2020-04-20 12:17:11

如何使用 sklearn 中的 train_test_split 确保用户和项目同时出现在训练和测试数据集中？

问题描述

2 个解决方案

解决方案1 0 2016-02-17 12:38:19

解决方案2 0 2020-04-20 12:17:11

解决方案1
0 2016-02-17 12:38:19

解决方案2
0 2020-04-20 12:17:11