简体   繁体   中英

Counting Occurrences of Zip Codes in Big Data Set w/Python

I'm a python newbie looking to count the 100 most occurring zip codes in several .csv files (6+). There are literally 3 million+ zip codes in the data set, and I'm looking for a way to pull out only the top 100 most occurring. Here is a sample of code below that was inspired from another post, although I'm trying to count across several .csv files. Thanks in advance!

import csv
import collections

zip = collections.Counter()
with open('zipcodefile1.csv', 'zipcodefile2.csv', 'zipcodefile3.csv') as input file:
   for row in csv.reader(input_file, delimiter=';'):
       ZIP[row[1]] += 1

print ZIP.most_common(100)

I'd suggest using Python's generators here, as they will be nice and efficient. First, suppose we have two files:

zc1.txt :

something;00001
another;00002
test;00003

and zc2.txt :

foo;00001
bar;00001
quuz;00003

Now let's write a function that takes several filenames and iterates through the lines in all of the files, returning only the zip codes:

import csv

def iter_zipcodes(paths):
    for path in paths:
        with open(path) as fh:
            for row in csv.reader(fh, delimiter=';'):
                yield row[1]

Note that we write yield row[1] . This signals that the function is a generator , and it returns its values lazily.

We can test it out as follows:

>>> list(iter_zipcodes(['zc1.txt', 'zc2.txt']))
['00001', '00002', '00003', '00001', '00001', '00003']

So we see that the generator simply spits out the zip codes in each file, in order. Now let's count them:

>>> zipcodes = iter_zipcodes(['zc1.txt', 'zc2.txt'])
>>> counts = collections.Counter(zipcodes)
>>> counts
Counter({'00001': 3, '00002': 1, '00003': 2})

Looks like it worked. This approach is efficient because it only reads one line in at a time. When one file is completely read, it moves on to the next.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM