I am trying to process a relatively large (about 100k lines) csv file in python. This is what my code looks like:
#!/usr/bin/env python
import sys
reload(sys)
sys.setdefaultencoding("utf8")
import csv
import os
csvFileName = sys.argv[1]
with open(csvFileName, 'r') as inputFile:
parsedFile = csv.DictReader(inputFile, delimiter=',')
totalCount = 0
for row in parsedFile:
target = row['new']
source = row['old']
systemLine = "some_curl_command {source}, {target}".format(source = source, target = target)
os.system(systemLine)
totalCount += 1
print "\nProcessed number: " + str(totalCount)
I'm not sure how to optimize this script. Should I use something besides DictReader?
I have to use Python 2.7, and cannot upgrade to Python 3.
If you want to avoid multiprocessing it is possible to split your long csv file into few smaller csvs and run them simultaneously. Like
$ python your_script.py 1.csv & $ python your_script.py 2.csv &
Ampersand stands for background execution in linux envs. More details here. I don't have enough knowledge about anything similar in Windows, but it's possible to open few cmd windows, lol.
Anyway it's much better to stick with multiprocessing , ofc.
What about to use requests
instead of curl?
import requests response = requests.get(source_url) html = response.content with open(target, "w") as file: file.write(html)
running
subprocess.Popen(systemLine)
instead of
os.system(systemLine)
should speed things up. Please note that systemLine has to be a list of strings eg ['some_curl_command', 'source', 'target'] in order to work. If you want to limit the number of concurrent commands have a look at that .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.