How can I optimize my code to process faster?

Question

I have some performance issues with the code that I wrote. The objective of the code is to compare 2 csv files (with over 900k rows in one, and 50k ~ 80k rows in the other).

The goal is, to compare csv1 and csv2, and write matching data to the 3rd csv.

The data I have look like this:

CSV1:

address,name,order_no
add1,John,d009
add2,Smith,d019
add3,Mary,d890
.....(900k more rows)

CSV2:

address,hub_id
add3,x345
add4,x310
add1,a109
....(50k ~ 80k more rows)

The expected output:

CSV3:

order_no,hub_id
d890,x345
d009,a109
.....(etc)

The code I'm working on right now (albeit simple) actually works. But, the whole process of comparing and writing takes a very long time to finish.

Any pointer will be very appreciated. I might have overlooked some python function that could be used in the case of comparing large data, since I just started learning.

import csv
import time
start_time = time.time()

with open('csv1.csv', newline='', encoding='Latin-1') as masterfile:
    reader = csv.DictReader(masterfile)
    for row in reader:
        with open('csv2.csv', newline='', encoding='Latin-1') as list1:
            reader2 = csv.DictReader(list1)
            for row2 in reader2:
                if row2['address'] == row['address']:
                     with open('csv3.csv', 'a') as corder:
                     print(row2['wip'] + ', ' + row['lat'] + ', ' + row['long'], file=corder)

print("--- %s seconds ---" % (time.time() - start_time))

Answer 1

What your algorithm is currently doing:

Load a row of the big file.
Open the smaller file.
Do a linear search in the small file, from disk
Open the output file and write to it.
Rinse and repeat.

All these steps are done 900k+ times.

Step #2, opening the smaller file, should only ever be done once. Opening a file and loading it from disk is an expensive operation. Just from loading it once at the beginning and doing the linear search (step #3) in memory, you would see great improvement.

The same goes for step #4: opening the output file should only be done once. The system will flush the file to disk every time you close it. This is a very wasteful step. If you keep the file open, output data a buffered until there is enough to write a full block to the disk, which is a much faster way to accomplish that.

Step #3 can be optimized a lot by using the correct data structure. One of the most common uses of probability in daily life is the hash table. They are ubiquitous because they make lookup a constant-time operation (unlike linear search, which scales linearly with the size of your input). Hash tables are implemented in the dict class in Python. By creating a dict with address as the key, you can reduce your processing time to a multiple of 900k + 80k rather than one of 900k * 80k . Look up algorithmic complexity to learn more. I particularly recommend Steve Skiena's "The Algorithm Design Manual".

One final step is to find the intersection of the address in each file. There are a few options available. You can convert both files into dict s and do a set -like intersection of the keys, or you can load one file into a dict and test the other one against it line-by-line. I highly recommend the latter, with the smaller file as the one you load into a dict . From an algorithmic perspective, having 10 times fewer elements means that you reduce the probability of hash collisions. This is also the cheapest approach, since it fails fast on irrelevant lines of the larger file, without recording them. From a practical standpoint, you may not even have have the option of converting the larger file straightforwardly into a dictionary, if, as I suspect, it has multiple rows with the same address.

Here is an implementation of what I've been talking about:

with open('csv2.csv', newline='', encoding='Latin-1') as lookupfile:
    lookup = dict(csv.reader(lookupfile))

with open('csv1.csv', newline='', encoding='Latin-1') as masterfile, open('csv3.csv', 'w') as corder:
    reader = csv.reader(masterfile)
    corder.write('order_no,hub_id\n')
    for address, name, order_no in reader:
        hub_id = lookup.get(address)
        if hub_id is not None:
            corder.write(f'{order_no},{hub_id}\n')

The expression dict(csv.reader(lookupfile)) will fail if any of the rows are not exactly two elements long. For example, blank lines will crash it. This is because the constructor of dict expects an iterable of two-element sequences to initialize the key-value mappings.

As a minor optimization, I've not used csv.DictReader , as that requires extra processing for each line. Furthermore, I've removed the csv module from the output entirely, since you can do the job much faster without adding layers of wrappers. If your files are as neatly formatted as you show, you may get a tiny performance boost from splitting them around , yourself, rather than using csv .

Answer 2

it's long because:

the complexity is O(n**2) . never perform linear searches on big data like this
the constant file read/write adds to the toll

You can do much better by creating 2 dictionaries with the address as key and the full row as value.

Then perform intersection of the keys, and write the result, picking data in each dictionary as required.

The following code was tested on your sample data

import csv

with open('csv1.csv', newline='', encoding='Latin-1') as f:
    reader = csv.DictReader(f)
    master_dict = {row["address"]:row for row in reader}
with open('csv2.csv', newline='', encoding='Latin-1') as f:
    reader = csv.DictReader(f)
    secondary_dict = {row["address"]:row for row in reader}

# key intersection

common_keys = set(master_dict) & set(secondary_dict)

with open("result.csv", "w", newline='', encoding='Latin-1') as f:
    writer = csv.writer(f)
    writer.writerow(['order_no',"hub_id"])
    writer.writerows([master_dict[x]['order_no'],secondary_dict[x]["hub_id"]] for x in common_keys)

the result is:

order_no,hub_id
d009,a109
d890,x345

How can I optimize my code to process faster?

Question

2 answers

solution1
2 ACCPTED 2019-07-11 12:04:13

solution2
1 2019-07-11 11:26:18

How can I optimize my code to process faster?

Question

2 answers

solution1 2 ACCPTED 2019-07-11 12:04:13

solution2 1 2019-07-11 11:26:18

solution1
2 ACCPTED 2019-07-11 12:04:13

solution2
1 2019-07-11 11:26:18