Randomly splitting training and testing data

Question

I have around 3000 objects where each object has a count associated with it. I want to randomly divide these objects in training and testing data with a 70% training and 30% testing split. But, I want to divide them based on the count associated with each object but not based on the number of objects.

An example, assuming my dataset contains 5 objects.

Obj 1 => 200
Obj 2 => 30
Obj 3 => 40
Obj 4 => 20
Obj 5 => 110

If I split them with a nearly 70%-30% ratio, my training set should be

Obj 2 => 30
Obj 3 => 40
Obj 4 => 20
Obj 5 => 110

and my testing set would be

Obj 1 => 200

If I split them again, I should get a different training and testing set nearing the 70-30 split ratio. I understand the above split does not give me pure 70-30 split but as long as it nears it, it's acceptable.

Are there any predefined methods/packages to do this in Python?

Answer 1

Assuming I understand your question correctly, my suggestion would be this:

from random import shuffle
sum = sum([obj.count for obj in obj_list]) #Get the total "count" of all the objects, O(n)
shuffle(obj_list)
running_sum = 0
i = 0
while running_sum < sum * .3
    running_sum += obj_list[i].count
    i += 1
training_data = obj_list[i:]
testing_data = obj_list[:i]

This entire operation is O(n), you're not going to get any better time complexity than that. There's certainly ways to condense the loop and whatnot into one liners, but I don't know of any builtins that accomplish what you're asking with a single function, especially not when you're asking it to be "random" in the sense that you want a different training/testing set each time you split it (as I understand the question)

Answer 2

I do not know if there is a specific function in Python, but assuming there isn't, here is an approach.

Shuffle objects:

 from random import shuffle
 values = shuffle[200, 40, 30, 110, 20]

Calculate percentage of dictionary values:

 prob = [float(i)/sum(values) for i in values]

Apply a loop:

sum=0
for i in range(len(result)):
    if sum>0.7:
        index=i-1  
        break
    sum=sum+result[i]

Now, objects before index are training objects and after it are testing objects.

Randomly splitting training and testing data

Question

2 answers

solution1
2 2016-07-27 13:53:44

solution2
0 2016-07-27 13:53:59

Randomly splitting training and testing data

Question

2 answers

solution1 2 2016-07-27 13:53:44

solution2 0 2016-07-27 13:53:59

solution1
2 2016-07-27 13:53:44

solution2
0 2016-07-27 13:53:59