简体   繁体   中英

Kernel dies when using multithreading

I am using jupyter notebook to count the occurrence of a value in multiple csv files. I have around 60 csv files, each about 1GB big. To efficiently loop through them, I use multithreading. However, the kernel keeps dying whenever I execute the following code:

from multiprocessing.dummy import Pool as ThreadPool 
files = glob.glob(path + '/*.csv')
def func(f):
    df = pd.read_csv(f)
    df = df[df['key'] == 1]
    return df['key'].value_counts()



pool = ThreadPool(4) 
results = pool.map(func, files)

pool.close() 
pool.join() 

results

What could be the reason for this? Is there a way to fix this?

There are two issues in your code.

  1. For Python, you are actually using multi-threading instead of multi-processing by using the Pool in multiprocessing.dummy. Change to below if you wanted to use multi-processing:

     from multiprocessing import Pool 

    But as you mentioned there are ~60G data I'm afraid your local computer can't handle this?

  2. I believe you need a powerful cluster for this task (no more pandas). so you may need to consider something like Spark.

     df = spark.read.csv(your_file_list, header=True) df = df.filter(df.Key == 1) df.head(5) # you can use df.collect() if the resultset if not too large 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM