简体   繁体   中英

How can rpy2 (R code from python) be run in a thread without blocking other threads?

Problem summary

I am trying to run R code from python using rpy2 in a second thread (not multiprocessing , just a thread is fine), but it seems to keep blocking my main thread. Can this be avoided?


My original code:

import time
import threading
import rpy2.robjects as robjects

def long_r_function():
    robjects.r('Sys.sleep(10)')    # pretend we have a long-running function in R

r_thread = threading.Thread(target=long_r_function, daemon=True)
start_time = time.time()
r_thread.start()

while r_thread.is_alive():
    print("R running...")    # expect to be able to see this print from main thread, 
    time.sleep(2)            # while R does work in second thread

print("Finished after ", time.time() - start_time, " seconds")

Expected output:

I'd expect to see 5 " R running " prints, because the main thread shouldn't get blocked while R is sleeping for 10 seconds.

Actual output:

A single print where I expected 5:

R running...
Finished after  12.006645679473877  seconds

Moving rpy2.objects import into thread's target function:

Because the import rpy2.robjects as robjects statement already causes the R environment to be initialized, I figured it might help to move this into the thread's target function long_r_function , to see if that would somehow decouple it from the main thread. This did not work .


Replacing rpy2 code with python code :

Just as a sanity check, if I completely get rid of the rpy2 code and instead define the thread's target function as

def long_r_function():
    time.sleep(10)

I correctly get the expected output (which is now lying because there isn't any R running):

R running...
R running...
R running...
R running...
R running...
Finished after  10.002039194107056  seconds

Update: rpy2 in main thread also blocks second thread

When I switch the threads around with the following code, such that rpy2 runs R code in the main thread and I try to see if python can still do something in a second thread, the same issue occurs.

def print_and_sleep():
    while True:
        print("Second thread running...")
        time.sleep(2)

py_thread = threading.Thread(target=print_and_sleep, daemon=True)
py_thread.start()
robjects.r('Sys.sleep(10)')

Output: only one or two prints of "Second thread running..." , instead of the expected 5. Again, works fine with a similar sanity check to the one above, where rpy2 is replaced with just python code.

So, this is not specific to the main thread. It seems like rpy2 simply blocks all python threads, regardless of where it's running?


Version info :

rpy2.__version__ = 2.8.5
rpy2.rinterface.R_VERSION_BUILD = ('3', '3.2', '', 71607)
sys.version = 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)]

Python uses a lock (known as the GIL) to prevent the concurrent execution of bytecode, and when calling C (which is what is happening when executing R code through rpy2) by default the GIL is not released until that code returns.

Now it is possible to release the GIL from within a C extension but R is very unfriendly to multithreading (as in "will segfault really quickly"). Because of this, the effort to implement a release of the GIL and handle all possible ways where concurrent evaluation of R could happen did not feel worth it.

If the purpose is to parallelize R code with rpy2, use processes rather an than thread (for example using Python's multiprocessing ).

Igautier's answer explains why this turns out to be difficult technically. The best way to achieve the intended goal appears to be using multiprocessing instead of threading , even if threads would be sufficient otherwise. The code then looks as follows:

import time
from multiprocessing import Process

def long_r_function():
    import rpy2.robjects as robjects
    robjects.r('Sys.sleep(10)')    # pretend we have a long-running function in R

if __name__ == '__main__':
    r_process = Process(target=long_r_function)
    start_time = time.time()
    r_process.start()

    while r_process.is_alive():
        print("R running...")    # expect to be able to see this print from main process,
        time.sleep(2)            # while R does work in second process

    print("Finished after ", time.time() - start_time, " seconds")

Which gives the output as intended (often 6 or 7 prints instead of 5 due to overhead of creating a new process):

R running...
R running...
R running...
R running...
R running...
R running...
Finished after  12.413572311401367  seconds

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM