简体   繁体   中英

Scikit-learn train_test_split inside multiprocessing pool on Linux (armv7l) does not work

I am experiencing some weird behaviour using the train_test_split inside a multiprocessing pool, when running Python on the Rasbperry Pi 3.

I have something like this:

def evaluate_Classifier(model,Features,Labels,split_ratio):

  X_train, X_val, y_train, y_val = train_test_split(Features,Labels,test_size=split_ratio)
...


iterations=500
pool = multiprocessing.Pool(4)
results = [pool.apply_async(evaluate_Classifier, args=(w,Current_Features,Current_Labels,0.35)) for i in range(iterations)]
output = [p.get() for p in results]
pool.close()
pool.join()

Now the above code works perfectly on Windows 7 Python 3.5.6, and indeed every single of the 4 threads will have a difference train/test split.

However, when I run it on the Raspberry Pi 3 (scikit-learn 0.19.2) it seems that the 4 threads split the data in EXACTLY the same way and so all the threads produce the exact same result. The next 4 threads will split the data again (differently this time), but still EXACTLY the same way between them, and so on....

I even tried using train_test_split with a random_state=np.random.randint, but it does not help.

Any ideas why this works in Windows but on the raspberry Pi 3 it doesn't seem to parallelise properly?

Many thanks

Instead of setting a random state, you should try shuffling the data before splitting. You can do this by setting the parameter: shuffle=True.

shuffle is on by default so even with shuffle=True it does not make a difference. And also I would like to split the data inside the parallelized function if possible.

Actually, some digging around i found out it is because of how Windows and Linux deal with multiple threads and resources of child processes etc etc. The best solution to the above is to do as follows:

def evaluate_Classifier(model,Features,Labels,split_ratio,i):

X_train, X_val, y_train, y_val =   train_test_split(Features,Labels,test_size=split_ratio,random_state=i)
...


iterations=500
pool = multiprocessing.Pool(4)
results = [pool.apply_async(evaluate_Classifier,   args=(w,Current_Features,Current_Labels,0.35, i)) for i in range(iterations)]
output = [p.get() for p in results]
pool.close()
pool.join()

That will work well, and for a bit more randomness between different runs of the code we can use some random number generator outside of the function instead of i

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM