简体   繁体   中英

Python - Sample one element randomly from a list based on the unique elements of another list

I have 2 lists containing user_ids and item_ids . I want to sample one item for each user randomly.

For Ex.

user_ids = [1,2,3 ,1, 2]
item ids = [8,9,10,5,8]

I want to get -

val_user_ids  = [1,2,3]
val_item_ids = [5,9,10]

I know some inefficient ways like looping etc. Is there any efficient way to do so? Or is there exist any python function for the same?

To be precise, I want to create a validation set (from the training set) containing 1 item interaction for each user.

You can gather your data in a dictionary with the user_id as the key and the item_ids in a list as the value

import collections

user_ids = [1, 2, 3, 1, 2]
item_ids = [8, 9, 10, 5, 8]

data = collections.defaultdict(list)
for key, value in zip(user_ids, item_ids):
    data[key].append(value)

The result is defaultdict(<class 'list'>, {1: [8, 5], 2: [9, 8], 3: [10]}) .

Now we can loop over the dictionary and get a random item from the list.

import random
result = [(key, random.choice(value)) for key, value in data.items()]

The result is [(1, 8), (2, 9), (3, 10)] (or [(1, 8), (2, 8), (3, 10)] or whatever the randomization will give us).


Some more information concerning the defaultdict . This kind of dictionary will create a default item if it doesn't exist. The default is given as a parameter when creating the defaultdict . Using a standard dict we have to handle the creation of the entry ourselves.

This is how it would be done manually:

user_ids = [1, 2, 3, 1, 2]
item_ids = [8, 9, 10, 5, 8]

data = dict()
for key, value in zip(user_ids, item_ids):
    if key not in data:
        data[key] = []
    data[key].append(value)

Could you use numpy? an example code would be:

import numpy as np 

idx = list(range(your_list_size))

# make random draw based your validation size 
val_size = 0.2
val_n = int(your_list_size*val_size)

# draw sample from user and item list, replace=False means no replacement
chosen_idx = np.random.choice(idx, size=val_n, replace=False)

# get actual values by chosen idx
sample_users = np.array(user_ids)[chosen_idx]
sample_items = np.array(item_ids)[chosen_idx]

or even simply do the followings:

sample_users = np.random.choice(user_ids, size=val_n, replace=False)
sample_items = np.random.choice(items_ids, size=val_n, replace=False)

Assuming the items need to be sampled with replacement, the following code will work:

import random

user_ids = [1,2,3,1,2]
item_ids = [8,9,10,5,8]
val_user_ids = sorted(set(user_ids))
val_item_ids = [random.choice(item_ids) for item in val_user_ids]

The set built-in returns a set (unique items) from an iterable like a list, and then the sorted built-in function returns a sorted list (if you don't need to sort, just use list(set(user_ids)) ). The list comprehension then creates (usually more efficiently than a for loop in terms of execution speed) a new list with the items sampled from item_ids, with replacement. One caveat: the user_id list needs to contain immutable items for this code to work (numbers are fine, so are strings, frozensets, and tuples as long as the tuple does not contain mutable structures like lists).

If instead you need to sample without replacement, you could use:

import random

user_ids = [1,2,3 ,1, 2]
item_ids = [8,9,10,5,8]
val_user_ids = sorted(set(user_ids))
random.shuffle(item_ids)
val_item_ids = [item_ids.pop(i) for i in range(len(val_user_ids))]

The same caveat about sets applied (can't contain anything mutable).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM