如何使用多线程加速读写csv - python

Question

I have 30 csv files.我有 30 个 csv 文件。 Each file has 200,000 row and 10 columns.每个文件有 200,000 行和 10 列。
I want to read these files and do some process.我想阅读这些文件并做一些处理。 Below is the code without multi-thread:下面是没有多线程的代码：

import os
import time

csv_dir = './csv'
csv_save_dir = './save_csv'
csv_files = os.listdir(csv_dir)

if __name__ == '__main__':
    if not os.path.exists(csv_save_dir):
        os.makedirs(csv_save_dir)
    start = time.perf_counter()

    for csv_file in csv_files:
        csv_file_path = os.path.join(csv_dir,csv_file)
        with open(csv_file_path,'r') as f:
            lines = f.readlines()
    
        csv_file_save_path = os.path.join(csv_save_dir,'1_'+csv_file)
        with open(csv_file_save_path,'w') as f:
            f.writelines(lines[:20])
        print(f'CSV File saved...')
    
    finish = time.perf_counter()

    print(f'Finished in {round(finish-start, 2)} second(s)')

The elapsed time of the above code is about 7 seconds.上述代码的运行时间约为 7 秒。 This time, I modified the above code with multi-thread.这一次，我用多线程修改了上面的代码。 The code is as follows:代码如下：

import os
import time
import concurrent.futures

csv_dir = './csv'
csv_save_dir = './save_csv'
csv_files = os.listdir(csv_dir)

def read_and_write_csv(csv_file):
    csv_file_path = os.path.join(csv_dir,csv_file)
    with open(csv_file_path,'r') as f:
        lines = f.readlines()

    csv_file_save_path = os.path.join(csv_save_dir,'1_'+csv_file)
    with open(csv_file_save_path,'w') as f:
        f.writelines(lines[:20])
    print(f'CSV File saved...')

if __name__ == '__main__':
    if not os.path.exists(csv_save_dir):
        os.makedirs(csv_save_dir)

    start = time.perf_counter()
    with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
        executor.map(read_and_write_csv, [csv_file for csv_file in csv_files])
    finish = time.perf_counter()

    print(f'Finished in {round(finish-start, 2)} second(s)')

I expected the above code take less time than my first code because of using multi-threads. 由于使用了多线程，我预计上面的代码比我的第一个代码花费的时间更少。 But the elapsed time is about 7 seconds!! 但是经过的时间大约是7秒！！

Is there way to speed up using multi-threads? 有没有办法加快使用多线程？

Answer 1

I don't agree with the comments.我不同意评论。 Python will happily use multiple CPU cores if you have them, executing threads on separate cores. Python 会很高兴地使用多个 CPU 内核，如果你有它们，在不同的内核上执行线程。

What I think is the issue here is your test.我认为这里的问题是你的测试。 If you added the "do some process" you mentioned to your thread workers, I think you may find the multi-thread version to be faster.如果您向线程工作人员添加了您提到的“执行某些过程”，我想您可能会发现多线程版本更快。 Right now your test merely shows it takes about 7 seconds to read/write the CSV files which will be I/O locked and not take advantage of the CPUs.现在，您的测试仅显示读取/写入 CSV 文件需要大约 7 秒，这些文件将被 I/O 锁定并且不利用 CPU。

If your "do some process" is non-trivial, I'd use multi-threading differently.如果你的“做一些过程”很重要，我会以不同的方式使用多线程。 Right now, you are having each thread do:现在，您正在让每个线程执行以下操作：

read csv file
process csv file
save csv file

This way, you are getting thread lock during the read and save steps, slowing things down.这样，您在读取和保存步骤期间会获得线程锁定，从而减慢速度。

For a non-trivial "process" step, I'd do this: (pseudo-code)对于一个不平凡的“过程”步骤，我会这样做：（伪代码）

def process_csv(line):
    <perform your processing on a single line>

<main>:
    csv_file for csv_file in csv_files:
        <read lines from csv>

        with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
            executor.map(process_csv, [line for line in lines])

        <write lines out to csv>

Since you're locking on read/write anyway, here at least the work-per-line is being spread across cores.由于无论如何您都在锁定读/写，因此至少每行工作都分布在内核中。 And you're not trying to read all CSV's into memory simultaneously.而且您不会尝试同时将所有 CSV 读入 memory。 Pick max-workers value appropriate for the number of cores in your system.选择适合系统中核心数量的 max-workers 值。

If "do some process" is trivial, my suggestion is probably pointless.如果“做一些过程”是微不足道的，我的建议可能毫无意义。

如何使用多线程加速读写csv - python

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-05-14 18:26:31

如何使用多线程加速读写csv - python

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-05-14 18:26:31

解决方案1
1 已采纳 2021-05-14 18:26:31