简体   繁体   English

随机分配训练和测试数据

[英]Randomly splitting training and testing data

I have around 3000 objects where each object has a count associated with it. 我大约有3000个对象,每个对象都有与之相关的计数。 I want to randomly divide these objects in training and testing data with a 70% training and 30% testing split. 我想将训练和测试数据中的这些对象随机分为70%的训练和30%的测试拆分。 But, I want to divide them based on the count associated with each object but not based on the number of objects. 但是,我想根据与每个对象相关的计数对它们进行划分,而不是根据对象的数量进行划分。

An example, assuming my dataset contains 5 objects. 例如,假设我的数据集包含5个对象。

Obj 1 => 200
Obj 2 => 30
Obj 3 => 40
Obj 4 => 20
Obj 5 => 110

If I split them with a nearly 70%-30% ratio, my training set should be 如果我以近70%-30%的比例进行拆分,则我的训练集应该是

Obj 2 => 30
Obj 3 => 40
Obj 4 => 20
Obj 5 => 110

and my testing set would be 我的测试集是

Obj 1 => 200

If I split them again, I should get a different training and testing set nearing the 70-30 split ratio. 如果我再次拆分,则应该获得接近70-30的拆分率的其他培训和测试集。 I understand the above split does not give me pure 70-30 split but as long as it nears it, it's acceptable. 我了解上述分割并不能给我纯粹的70-30分割,但只要接近,就可以接受。

Are there any predefined methods/packages to do this in Python? 在Python中是否有任何预定义的方法/软件包可以执行此操作?

Assuming I understand your question correctly, my suggestion would be this: 假设我正确理解了您的问题,我的建议是:

from random import shuffle
sum = sum([obj.count for obj in obj_list]) #Get the total "count" of all the objects, O(n)
shuffle(obj_list)
running_sum = 0
i = 0
while running_sum < sum * .3
    running_sum += obj_list[i].count
    i += 1
training_data = obj_list[i:]
testing_data = obj_list[:i]

This entire operation is O(n), you're not going to get any better time complexity than that. 整个操作为O(n),您将不会获得比这更好的时间复杂度。 There's certainly ways to condense the loop and whatnot into one liners, but I don't know of any builtins that accomplish what you're asking with a single function, especially not when you're asking it to be "random" in the sense that you want a different training/testing set each time you split it (as I understand the question) 当然,有多种方法可以将循环和其他内容压缩到一个内衬中,但是我不知道有任何内置函数可以通过单个函数来完成您所要的内容,尤其是当您要求将其视为“随机”时您每次拆分时都需要不同的培训/测试集(据我所知,这个问题)

I do not know if there is a specific function in Python, but assuming there isn't, here is an approach. 我不知道Python中是否有特定功能,但是假设没有,这是一种方法。

Shuffle objects: 随机播放对象:

 from random import shuffle
 values = shuffle[200, 40, 30, 110, 20]

Calculate percentage of dictionary values: 计算字典值的百分比:

 prob = [float(i)/sum(values) for i in values]

Apply a loop: 应用循环:

sum=0
for i in range(len(result)):
    if sum>0.7:
        index=i-1  
        break
    sum=sum+result[i]

Now, objects before index are training objects and after it are testing objects. 现在,索引之前的对象是训练对象,而索引之后的对象是测试对象。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM