Kernel dies when using multithreading

Question

I am using jupyter notebook to count the occurrence of a value in multiple csv files. I have around 60 csv files, each about 1GB big. To efficiently loop through them, I use multithreading. However, the kernel keeps dying whenever I execute the following code:

from multiprocessing.dummy import Pool as ThreadPool 
files = glob.glob(path + '/*.csv')
def func(f):
    df = pd.read_csv(f)
    df = df[df['key'] == 1]
    return df['key'].value_counts()



pool = ThreadPool(4) 
results = pool.map(func, files)

pool.close() 
pool.join() 

results

What could be the reason for this? Is there a way to fix this?

Answer 1

There are two issues in your code.

For Python, you are actually using multi-threading instead of multi-processing by using the Pool in multiprocessing.dummy. Change to below if you wanted to use multi-processing:
```
 from multiprocessing import Pool 
```
But as you mentioned there are ~60G data I'm afraid your local computer can't handle this?

I believe you need a powerful cluster for this task (no more pandas). so you may need to consider something like Spark.

 df = spark.read.csv(your_file_list, header=True) df = df.filter(df.Key == 1) df.head(5) # you can use df.collect() if the resultset if not too large

Kernel dies when using multithreading

Question

1 answers

solution1
1 ACCPTED 2018-05-28 00:03:03

Kernel dies when using multithreading

Question

1 answers

solution1 1 ACCPTED 2018-05-28 00:03:03

solution1
1 ACCPTED 2018-05-28 00:03:03