Linux（armv7l）上的多处理池中的Scikit-learn train_test_split不起作用

Question

I am experiencing some weird behaviour using the train_test_split inside a multiprocessing pool, when running Python on the Rasbperry Pi 3. 在Rasbperry Pi 3上运行Python时，我在多处理池中使用train_test_split遇到一些奇怪的行为。

I have something like this: 我有这样的事情：

def evaluate_Classifier(model,Features,Labels,split_ratio):

  X_train, X_val, y_train, y_val = train_test_split(Features,Labels,test_size=split_ratio)
...


iterations=500
pool = multiprocessing.Pool(4)
results = [pool.apply_async(evaluate_Classifier, args=(w,Current_Features,Current_Labels,0.35)) for i in range(iterations)]
output = [p.get() for p in results]
pool.close()
pool.join()

Now the above code works perfectly on Windows 7 Python 3.5.6, and indeed every single of the 4 threads will have a difference train/test split. 现在，上面的代码可以在Windows 7 Python 3.5.6上完美运行，实际上4个线程中的每个线程都有不同的训练/测试拆分。

However, when I run it on the Raspberry Pi 3 (scikit-learn 0.19.2) it seems that the 4 threads split the data in EXACTLY the same way and so all the threads produce the exact same result. 但是，当我在Raspberry Pi 3（scikit-learn 0.19.2）上运行它时，似乎4个线程以完全相同的方式拆分数据，因此所有线程都产生完全相同的结果。 The next 4 threads will split the data again (differently this time), but still EXACTLY the same way between them, and so on.... 接下来的4个线程将再次拆分数据（这次是不同的），但它们之间的方式仍然完全相同，依此类推。

I even tried using train_test_split with a random_state=np.random.randint, but it does not help. 我什至尝试将train_test_split与random_state = np.random.randint一起使用，但这无济于事。

Any ideas why this works in Windows but on the raspberry Pi 3 it doesn't seem to parallelise properly? 有什么想法为什么可以在Windows中使用，但在树莓派3上却似乎无法正确并行化？

Many thanks 非常感谢

Answer 1

Instead of setting a random state, you should try shuffling the data before splitting. 而不是设置随机状态，您应该在拆分之前尝试对数据进行改组。 You can do this by setting the parameter: shuffle=True. 您可以通过设置参数shuffle = True来实现。

Answer 2

shuffle is on by default so even with shuffle=True it does not make a difference. shuffle默认情况下处于启用状态，因此即使shuffle = True也不会产生任何影响。 And also I would like to split the data inside the parallelized function if possible. 另外，如果可能的话，我也想在并行化函数中拆分数据。

Actually, some digging around i found out it is because of how Windows and Linux deal with multiple threads and resources of child processes etc etc. The best solution to the above is to do as follows: 实际上，我发现有些原因是因为Windows和Linux如何处理子进程等的多个线程和资源。针对上述问题的最佳解决方案如下：

def evaluate_Classifier(model,Features,Labels,split_ratio,i):

X_train, X_val, y_train, y_val =   train_test_split(Features,Labels,test_size=split_ratio,random_state=i)
...


iterations=500
pool = multiprocessing.Pool(4)
results = [pool.apply_async(evaluate_Classifier,   args=(w,Current_Features,Current_Labels,0.35, i)) for i in range(iterations)]
output = [p.get() for p in results]
pool.close()
pool.join()

That will work well, and for a bit more randomness between different runs of the code we can use some random number generator outside of the function instead of i 它将很好地工作，并且对于不同代码运行之间的更多随机性，我们可以在函数外部使用一些随机数生成器，而不是i

Linux（armv7l）上的多处理池中的Scikit-learn train_test_split不起作用

问题描述

2 个解决方案

解决方案1
0 2018-09-12 12:37:03

解决方案2
0 2018-09-13 14:06:03

Linux（armv7l）上的多处理池中的Scikit-learn train_test_split不起作用

问题描述

2 个解决方案

解决方案1 0 2018-09-12 12:37:03

解决方案2 0 2018-09-13 14:06:03

解决方案1
0 2018-09-12 12:37:03

解决方案2
0 2018-09-13 14:06:03