I am using jupyter notebook to count the occurrence of a value in multiple csv files. I have around 60 csv files, each about 1GB big. To efficiently loop through them, I use multithreading. However, the kernel keeps dying whenever I execute the following code:
from multiprocessing.dummy import Pool as ThreadPool
files = glob.glob(path + '/*.csv')
def func(f):
df = pd.read_csv(f)
df = df[df['key'] == 1]
return df['key'].value_counts()
pool = ThreadPool(4)
results = pool.map(func, files)
pool.close()
pool.join()
results
What could be the reason for this? Is there a way to fix this?
There are two issues in your code.
For Python, you are actually using multi-threading instead of multi-processing by using the Pool in multiprocessing.dummy. Change to below if you wanted to use multi-processing:
from multiprocessing import Pool
But as you mentioned there are ~60G data I'm afraid your local computer can't handle this?
I believe you need a powerful cluster for this task (no more pandas). so you may need to consider something like Spark.
df = spark.read.csv(your_file_list, header=True) df = df.filter(df.Key == 1) df.head(5) # you can use df.collect() if the resultset if not too large
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.