I have been experimenting with multithreading using the threading
library and creating a different thread for several different functions. The functions take in a pandas dataframe as the argument and run an SQL query to AWS Redshift and add the retrieved data as a column to the dataframe. However, I have an issue where sometimes one of the columns will be empty when printing the dataframe after the threads have finished. This is seemingly random and sometimes all of the columns are added without any issues. I thought the purpose of .join()
was to prevent this by waiting until each thread had been finished before continuing, but this does not seem to be the case.
import pandas as pd
import threading
df = pd.DataFrame()
def redshift_query1(df):
run query
df[column_name1] = query_results
def redshift_query2(df):
run query
df[column_name2] = query_results
def redshift_query3(df):
run query
df[column_name3] = query_results
t1 = threading.Thread(target=redshift_query1, args = [df])
t2 = threading.Thread(target=redshift_query2, args = [df])
t3 = threading.Thread(target=redshift_query3, args = [df])
t1.start()
t2.start()
t3.start()
t1.join()
t2.join()
t3.join()
print(df)
pandas is not thread safe. For more information, see . However, builtin types are thread safe in Python. So you can hold the result in a dict then create a DataFrame.
import pandas as pd
import threading
result = {}
def redshift_query1(df):
result["column_name1"] = [3]
def redshift_query2(df):
result["column_name2"] = [2]
def redshift_query3(df):
result["column_name3"] = [1]
t1 = threading.Thread(target=redshift_query1, args = [df])
t2 = threading.Thread(target=redshift_query2, args = [df])
t3 = threading.Thread(target=redshift_query3, args = [df])
t1.start()
t2.start()
t3.start()
t1.join()
t2.join()
t3.join()
df = pd.DataFrame(result)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.