The answers to asynchronous slower than synchronous did not cover the scenario I was working on, hence this question.
I'm using Python 3.6.0 on Windows 10 to read 11 identical JSON files named k80.json
to k90.json
, 18.1 MB each.
First, I tried synchronous, sequential reading of all 11 files. It took 5.07 s to complete.
from json import load
from os.path import join
from time import time
def read_config(fname):
json_fp = open(fname)
json_data = load(json_fp)
json_fp.close()
return len(json_data)
if __name__ == '__main__':
NUM_THREADS = 12
idx = 0
in_files = [join('C:\\', 'Users', 'userA', 'Documents', f'k{idx}.json') for idx in range(80, 91)]
print('Starting sequential run.')
start_time1 = time()
for fname in in_files:
print(f'Reading file: {fname}')
print(f'The JSON file size is {read_config(fname)}')
read_duration1 = round(time() - start_time1, 2)
print('Ending sequential run.')
print(f'Synchoronous reading took {read_duration1}s')
print('\n' * 3)
Result
Starting sequential run.
Reading file: C:\Users\userA\Documents\k80.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k81.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k82.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k83.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k84.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k85.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k86.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k87.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k88.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k89.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k90.json
The JSON file size is 5
Ending sequential run.
Synchoronous reading took 5.07s
Next, I tried running this using a ThreadPoolExecutor
with the map
function call, using 12 threads. That took 5.69 s.
from concurrent.futures import ThreadPoolExecutor
from json import load
from os.path import join
from time import time
def read_config(fname):
json_fp = open(fname)
json_data = load(json_fp)
json_fp.close()
return len(json_data)
if __name__ == '__main__':
NUM_THREADS = 12
idx = 0
in_files = [join('C:\\', 'Users', 'userA', 'Documents', f'k{idx}.json') for idx in range(80, 91)]
th_pool = ThreadPoolExecutor(max_workers=NUM_THREADS)
print(f'Starting mapped pre-emptive threaded pool run with {NUM_THREADS} threads.')
start_time2 = time()
with th_pool:
map_iter = th_pool.map(read_config, in_files, timeout=10)
read_duration2 = round(time() - start_time2, 2)
print('The JSON file size is ')
map_results = list(map_iter)
for map_res in map_results:
print(f'The JSON file size is {map_res}')
print('Ending mapped pre-emptive threaded pool run.')
print(f'Mapped asynchoronous pre-emptive threaded pool reading took {read_duration2}s')
print('\n' * 3)
Result
Starting mapped pre-emptive threaded pool run with 12 threads.
The JSON file size is
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
Ending mapped pre-emptive threaded pool run.
Mapped asynchoronous pre-emptive threaded pool reading took 5.69s
Finally, I tried running this using a ThreadPoolExecutor
with the submit
function call, using 12 threads. That took 5.73 s.
from concurrent.futures import ThreadPoolExecutor
from json import load
from os.path import join
from time import time
def read_config(fname):
json_fp = open(fname)
json_data = load(json_fp)
json_fp.close()
return len(json_data)
if __name__ == '__main__':
NUM_THREADS = 12
idx = 0
in_files = [join('C:\\', 'Users', 'userA', 'Documents', f'k{idx}.json') for idx in range(80, 91)]
th_pool = ThreadPoolExecutor(max_workers=NUM_THREADS)
results = []
print(f'Starting submitted pre-emptive threaded pool run with {NUM_THREADS} threads.')
start_time3 = time()
with th_pool:
for fname in in_files:
results.append(th_pool.submit(read_config, fname))
read_duration3 = round(time() - start_time3, 2)
for result in results:
print(f'The JSON file size is {result.result(timeout=10)}')
print('Ending submitted pre-emptive threaded pool run.')
print(f'Submitted asynchoronous pre-emptive threaded pool reading took {read_duration3}s')
Result
Starting submitted pre-emptive threaded pool run with 12 threads.
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
Ending submitted pre-emptive threaded pool run.
Submitted asynchoronous pre-emptive threaded pool reading took 5.73s
Questions
Why does synchronous reading perform faster than threading when reading reasonably large JSON files like these? I was expecting threading to be faster given the file size and number of files being read.
Are JSON files of much larger sizes than these required for threading to perform better than synchronous reading? If not, what are the other factors to consider?
I thank you in advance for your time and help.
Post-Script
Thanks to the answers below, I changed the read_config
method slightly to introduce a 3s sleep delay (simulating an IO-wait operation), and now the threaded versions really shine ( 38.81s vs 9.36s and 9.39s ).
def read_config(fname):
json_fp = open(fname)
json_data = load(json_fp)
json_fp.close()
sleep(3) # Simulate an activity that waits on I/O.
return len(json_data)
Results
Starting sequential run.
Reading file: C:\Users\userA\Documents\k80.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k81.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k82.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k83.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k84.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k85.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k86.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k87.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k88.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k89.json
The JSON file size is 5
Reading file: C:\Users\userA\Documents\k90.json
The JSON file size is 5
Ending sequential run.
Synchoronous reading took 38.81s
Starting mapped pre-emptive threaded pool run with 12 threads.
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
Ending mapped pre-emptive threaded pool run.
Mapped asynchoronous pre-emptive threaded pool reading took 9.36s
Starting submitted pre-emptive threaded pool run with 12 threads.
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
The JSON file size is 5
Ending submitted pre-emptive threaded pool run.
Submitted asynchoronous pre-emptive threaded pool reading took 9.39s
I'm no expert but in general for threading to be useful for speed your program needs to be waiting for IO. Threading does not give you access to parallel CPU threads, it just allows operations to run in parallel sharing the same CPU time and Python interpreter (if you want access to more CPU you should look at ProcessPoolExecutor)
For example if you were reading from multiple remote databases rather than local files there would be a lot of time your program was waiting for IO with no local resource use. In this case threading could help as you could do your waiting in parallel, or process one item whilst waiting for another. However as all your data is coming from local files you are probably already maxing your local disk IO, you can't read multiple files at the same time (or at least not any faster than you can read them sequentially). Your machine still has to complete all the same tasks with the same resources, it doesn't really have any 'downtime' in either variant which is why they take pretty much the same time.
In this case your task is processor bound and not I/O bound - your CPU can loop through the data at a fixed rate. If you chop up the task across multiple threads, it's still going to take the same amount of time because you're processor will only incrementally work on each chunk one at a time (across different threads). The only time you get a speedup is when the task at hand is IO bound - say if you were trying to get data from a website which is taking a long time (even a 50 micro second lag would be considered a long time compared to the rate at which your CPU could processes the data if it already had it).
See my SO answer to another question here for a bit more explanation.
In an ideal world, in your case, multithreading would have taken the same amount of time as just performing the computation serially. However, in practice, it takes resources and time to actually break apart each task, allocate it to a thread, wait for each result, then finally stitch each result back together to return the final output to you. All of that adds up to the about 0.6 seconds longer run time you see in your parallelized output.
Larger JSON files would not speedup multi-threading. For that to happen, your JSON files in question would have had to be, say, hosted on a website. Lets break down what would happen in both the serial and parallel case:
Serial Case
Since the data transmission speed of the website is slow relative to the data crunching speed of your CPU, your CPU would sit idle waiting until the data was available. It would look like this:
JSON_file1.json
JSON_file1.json
JSON_file1.json
JSON_file2.json
... Parrallel Case
Since the data transmission speed of the website is slow relative to the data crunching speed of your CPU, your CPU would sit idle waiting until the data was available. So if you distribute each task across threads you can initiate each JSON_file request almost at the same time (across multiple threads).
JSON_file1.json
, JSON_file2.json
, JSON_file3.json
, JSON_file4.json
JSON_file
sJSON_file
is received, the CPU processes each file.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.