简体   繁体   English

如何加快此python脚本的读取和处理csv文件的速度?

[英]How can I speed up this python script to read and process a csv file?

I am trying to process a relatively large (about 100k lines) csv file in python. 我正在尝试在python中处理相对较大的(约10万行)csv文件。 This is what my code looks like: 这是我的代码如下所示:

#!/usr/bin/env python

import sys
reload(sys)
sys.setdefaultencoding("utf8")
import csv
import os

csvFileName = sys.argv[1]


with open(csvFileName, 'r') as inputFile:
    parsedFile = csv.DictReader(inputFile, delimiter=',')
     totalCount = 0
     for row in parsedFile:
         target = row['new']
         source = row['old']
         systemLine = "some_curl_command {source}, {target}".format(source = source, target = target)
         os.system(systemLine)
         totalCount += 1
         print "\nProcessed number: " + str(totalCount)

I'm not sure how to optimize this script. 我不确定如何优化此脚本。 Should I use something besides DictReader? 除DictReader外,我还应该使用其他东西吗?

I have to use Python 2.7, and cannot upgrade to Python 3. 我必须使用Python 2.7,并且无法升级到Python 3。

  1. If you want to avoid multiprocessing it is possible to split your long csv file into few smaller csvs and run them simultaneously. 如果要避免多处理,可以将较长的csv文件拆分为几个较小的csv,然后同时运行它们。 Like 喜欢

     $ python your_script.py 1.csv & $ python your_script.py 2.csv & 

Ampersand stands for background execution in linux envs. “与”号代表linux envs中的后台执行。 More details here. 此处有更多详细信息。 I don't have enough knowledge about anything similar in Windows, but it's possible to open few cmd windows, lol. 我对Windows中的类似功能还没有足够的了解,但是可以打开几个cmd窗口,哈哈。

Anyway it's much better to stick with multiprocessing , ofc. 无论如何,坚持多处理 ,ofc会更好。

  1. What about to use requests instead of curl? 如何使用requests而不是curl?

     import requests response = requests.get(source_url) html = response.content with open(target, "w") as file: file.write(html) 

Here's the doc. 这是文档。

  1. Avoid print statements, in long-term run they're slow as hell. 避免使用打印语句,从长远来看,它们运行起来很慢。 For development and debugging that's ok, but when you decide to start final execution of your script you can remove it and check count of processed files directly in the target folder. 可以进行开发和调试,但是当您决定开始最终执行脚本时,可以将其删除,然后直接在目标文件夹中检查已处理文件的数量。

running 赛跑

subprocess.Popen(systemLine)

instead of 代替

os.system(systemLine)

should speed things up. 应该加快速度。 Please note that systemLine has to be a list of strings eg ['some_curl_command', 'source', 'target'] in order to work. 请注意,systemLine必须是字符串列表,例如['some_curl_command','source','target']才能正常工作。 If you want to limit the number of concurrent commands have a look at that . 如果你想限制并发命令的数量看看那个

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM