I have the following snippet which iterates over a list of .csv
files and then uses a insert_csv_data
function which reads, preprocesses and inserts the .csv
file's data into a .hyper
file ( Hyper is Tableau's new in-memory data engine technology, designed for fast data ingest and analytical query processing on large or complex data sets ):
A detailed interpretation of the insert_csv_data
function can be found here
for csv in csv_list:
insert_csv_data(hyper)
The issue with the above code is that it inserts one .csv
file into the .hyper
file at a time, which is pretty slow at the moment.
I would like to know if there's a faster or parallel workaround as I'm using Apache Spark for processing on Databricks. I've done some research and found modules like multiprocessing
, joblib
and asyncio
that might work for my scenario, but I'm unsure of how to correctly implement them.
Please Advise
Edit:
Parallel Code:
from joblib import Parallel, delayed
element_run = Parallel(n_jobs=1)(delayed(insert_csv_data)(csv) for csv in csv_list)
This does not directly answer the question but demonstrates how multiprocessing and multithreading are easily interchangeable using the concurrent.futures module. Note that the two loops achieve exactly the same thing and that the only difference between the two sections of code the is the work manager class.
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
def tfunc(n):
return n * n
N = 1_000
def main():
with ThreadPoolExecutor() as executor:
for future in [executor.submit(tfunc, n) for n in range(N)]:
future.result()
with ProcessPoolExecutor() as executor:
for future in [executor.submit(tfunc, n) for n in range(N)]:
future.result()
if __name__ == '__main__':
main()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.