[英]How to speed up read and write csv with multi thread - python
I have 30 csv files.我有 30 个 csv 文件。 Each file has 200,000 row and 10 columns.每个文件有 200,000 行和 10 列。
I want to read these files and do some process.我想阅读这些文件并做一些处理。 Below is the code without multi-thread:下面是没有多线程的代码:
import os
import time
csv_dir = './csv'
csv_save_dir = './save_csv'
csv_files = os.listdir(csv_dir)
if __name__ == '__main__':
if not os.path.exists(csv_save_dir):
os.makedirs(csv_save_dir)
start = time.perf_counter()
for csv_file in csv_files:
csv_file_path = os.path.join(csv_dir,csv_file)
with open(csv_file_path,'r') as f:
lines = f.readlines()
csv_file_save_path = os.path.join(csv_save_dir,'1_'+csv_file)
with open(csv_file_save_path,'w') as f:
f.writelines(lines[:20])
print(f'CSV File saved...')
finish = time.perf_counter()
print(f'Finished in {round(finish-start, 2)} second(s)')
The elapsed time of the above code is about 7 seconds.上述代码的运行时间约为 7 秒。 This time, I modified the above code with multi-thread.这一次,我用多线程修改了上面的代码。 The code is as follows:代码如下:
import os
import time
import concurrent.futures
csv_dir = './csv'
csv_save_dir = './save_csv'
csv_files = os.listdir(csv_dir)
def read_and_write_csv(csv_file):
csv_file_path = os.path.join(csv_dir,csv_file)
with open(csv_file_path,'r') as f:
lines = f.readlines()
csv_file_save_path = os.path.join(csv_save_dir,'1_'+csv_file)
with open(csv_file_save_path,'w') as f:
f.writelines(lines[:20])
print(f'CSV File saved...')
if __name__ == '__main__':
if not os.path.exists(csv_save_dir):
os.makedirs(csv_save_dir)
start = time.perf_counter()
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
executor.map(read_and_write_csv, [csv_file for csv_file in csv_files])
finish = time.perf_counter()
print(f'Finished in {round(finish-start, 2)} second(s)')
I don't agree with the comments.我不同意评论。 Python will happily use multiple CPU cores if you have them, executing threads on separate cores. Python 会很高兴地使用多个 CPU 内核,如果你有它们,在不同的内核上执行线程。
What I think is the issue here is your test.我认为这里的问题是你的测试。 If you added the "do some process" you mentioned to your thread workers, I think you may find the multi-thread version to be faster.如果您向线程工作人员添加了您提到的“执行某些过程”,我想您可能会发现多线程版本更快。 Right now your test merely shows it takes about 7 seconds to read/write the CSV files which will be I/O locked and not take advantage of the CPUs.现在,您的测试仅显示读取/写入 CSV 文件需要大约 7 秒,这些文件将被 I/O 锁定并且不利用 CPU。
If your "do some process" is non-trivial, I'd use multi-threading differently.如果你的“做一些过程”很重要,我会以不同的方式使用多线程。 Right now, you are having each thread do:现在,您正在让每个线程执行以下操作:
read csv file
process csv file
save csv file
This way, you are getting thread lock during the read and save steps, slowing things down.这样,您在读取和保存步骤期间会获得线程锁定,从而减慢速度。
For a non-trivial "process" step, I'd do this: (pseudo-code)对于一个不平凡的“过程”步骤,我会这样做:(伪代码)
def process_csv(line):
<perform your processing on a single line>
<main>:
csv_file for csv_file in csv_files:
<read lines from csv>
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
executor.map(process_csv, [line for line in lines])
<write lines out to csv>
Since you're locking on read/write anyway, here at least the work-per-line is being spread across cores.由于无论如何您都在锁定读/写,因此至少每行工作都分布在内核中。 And you're not trying to read all CSV's into memory simultaneously.而且您不会尝试同时将所有 CSV 读入 memory。 Pick max-workers value appropriate for the number of cores in your system.选择适合系统中核心数量的 max-workers 值。
If "do some process" is trivial, my suggestion is probably pointless.如果“做一些过程”是微不足道的,我的建议可能毫无意义。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.