简体   繁体   中英

Optimize huge file CSV treatment

I know this question can be too broad, but I need to find a way to optimize the treatment of a CSV file which contains 10 000 rows.

Each row must be parsed and at every row, I will need to call Google API and do calculations, then I need to write CSV file with new informations.

Right now, I am using PHP and the treatment takes around 1/2 hours.

Is there a way to optimize this ? I thought about using NodeJS to parallelize treatments of rows ?

You can use curl_multi_select to paralelize the Google API requests. — Load the input into a queue, run queries in parallel, write output and load more as the result is finished. Something like TCP Sliding Window algorithm.

Alternatively, you can load all data into a (SQLite) database (10 000 rows is not much) and then run the calculations in parallel. The database will be easier to implement than creating the sliding window.

I don't think the NodeJS would be much faster. Certainly not that much to be worth rewriting the existing code you already have.

You can debug the code by checking how long does it take to read the 10K rows and update them with some random extra columns or extra info. This will give you some sense of how long it takes to read and write to a CSV with 10K rows. I believe this shouldn't take long.

The google api calls might be culprit. If you know node.js it is good option, but if that is too much of a pain, you can use php curl to send multiple requests at once without waiting for the response for each request. This might help speed up the process. You can refer to this site for more info http://bytes.schibsted.com/php-perform-requests-in-parallel/

10,000 rows should be no problem but when opening in Python 3.6, make sure you use readlines and read all at once. Using the csv reader should also help with any separator issues and quote characters such as '"'. I've been reading 1.3million rows and its not an issue. Mine takes about 6-8 minutes to process, so your should be of the order of a few seconds.

Are you using a machine with enough memory? If you are using a raspberry pi, small virtual machine or really old laptop I could imagine that this would greatly hamper your processing time. Otherwise, you should be having no issues at all with python.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM