Scikit-learn train_test_split inside multiprocessing pool on Linux (armv7l) does not work

Question

I am experiencing some weird behaviour using the train_test_split inside a multiprocessing pool, when running Python on the Rasbperry Pi 3.

I have something like this:

def evaluate_Classifier(model,Features,Labels,split_ratio):

  X_train, X_val, y_train, y_val = train_test_split(Features,Labels,test_size=split_ratio)
...


iterations=500
pool = multiprocessing.Pool(4)
results = [pool.apply_async(evaluate_Classifier, args=(w,Current_Features,Current_Labels,0.35)) for i in range(iterations)]
output = [p.get() for p in results]
pool.close()
pool.join()

Now the above code works perfectly on Windows 7 Python 3.5.6, and indeed every single of the 4 threads will have a difference train/test split.

However, when I run it on the Raspberry Pi 3 (scikit-learn 0.19.2) it seems that the 4 threads split the data in EXACTLY the same way and so all the threads produce the exact same result. The next 4 threads will split the data again (differently this time), but still EXACTLY the same way between them, and so on....

I even tried using train_test_split with a random_state=np.random.randint, but it does not help.

Any ideas why this works in Windows but on the raspberry Pi 3 it doesn't seem to parallelise properly?

Many thanks

Answer 1

Instead of setting a random state, you should try shuffling the data before splitting. You can do this by setting the parameter: shuffle=True.

Answer 2

shuffle is on by default so even with shuffle=True it does not make a difference. And also I would like to split the data inside the parallelized function if possible.

Actually, some digging around i found out it is because of how Windows and Linux deal with multiple threads and resources of child processes etc etc. The best solution to the above is to do as follows:

def evaluate_Classifier(model,Features,Labels,split_ratio,i):

X_train, X_val, y_train, y_val =   train_test_split(Features,Labels,test_size=split_ratio,random_state=i)
...


iterations=500
pool = multiprocessing.Pool(4)
results = [pool.apply_async(evaluate_Classifier,   args=(w,Current_Features,Current_Labels,0.35, i)) for i in range(iterations)]
output = [p.get() for p in results]
pool.close()
pool.join()

That will work well, and for a bit more randomness between different runs of the code we can use some random number generator outside of the function instead of i

Scikit-learn train_test_split inside multiprocessing pool on Linux (armv7l) does not work

Question

2 answers

solution1
0 2018-09-12 12:37:03

solution2
0 2018-09-13 14:06:03

Scikit-learn train_test_split inside multiprocessing pool on Linux (armv7l) does not work

Question

2 answers

solution1 0 2018-09-12 12:37:03

solution2 0 2018-09-13 14:06:03

solution1
0 2018-09-12 12:37:03

solution2
0 2018-09-13 14:06:03