简体   繁体   中英

How can I speed up this python script to read and process a csv file?

I am trying to process a relatively large (about 100k lines) csv file in python. This is what my code looks like:

#!/usr/bin/env python

import sys
reload(sys)
sys.setdefaultencoding("utf8")
import csv
import os

csvFileName = sys.argv[1]


with open(csvFileName, 'r') as inputFile:
    parsedFile = csv.DictReader(inputFile, delimiter=',')
     totalCount = 0
     for row in parsedFile:
         target = row['new']
         source = row['old']
         systemLine = "some_curl_command {source}, {target}".format(source = source, target = target)
         os.system(systemLine)
         totalCount += 1
         print "\nProcessed number: " + str(totalCount)

I'm not sure how to optimize this script. Should I use something besides DictReader?

I have to use Python 2.7, and cannot upgrade to Python 3.

  1. If you want to avoid multiprocessing it is possible to split your long csv file into few smaller csvs and run them simultaneously. Like

     $ python your_script.py 1.csv & $ python your_script.py 2.csv & 

Ampersand stands for background execution in linux envs. More details here. I don't have enough knowledge about anything similar in Windows, but it's possible to open few cmd windows, lol.

Anyway it's much better to stick with multiprocessing , ofc.

  1. What about to use requests instead of curl?

     import requests response = requests.get(source_url) html = response.content with open(target, "w") as file: file.write(html) 

Here's the doc.

  1. Avoid print statements, in long-term run they're slow as hell. For development and debugging that's ok, but when you decide to start final execution of your script you can remove it and check count of processed files directly in the target folder.

running

subprocess.Popen(systemLine)

instead of

os.system(systemLine)

should speed things up. Please note that systemLine has to be a list of strings eg ['some_curl_command', 'source', 'target'] in order to work. If you want to limit the number of concurrent commands have a look at that .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM