I'm a python newbie looking to count the 100 most occurring zip codes in several .csv files (6+). There are literally 3 million+ zip codes in the data set, and I'm looking for a way to pull out only the top 100 most occurring. Here is a sample of code below that was inspired from another post, although I'm trying to count across several .csv files. Thanks in advance!
import csv
import collections
zip = collections.Counter()
with open('zipcodefile1.csv', 'zipcodefile2.csv', 'zipcodefile3.csv') as input file:
for row in csv.reader(input_file, delimiter=';'):
ZIP[row[1]] += 1
print ZIP.most_common(100)
I'd suggest using Python's generators here, as they will be nice and efficient. First, suppose we have two files:
zc1.txt
:
something;00001
another;00002
test;00003
and zc2.txt
:
foo;00001
bar;00001
quuz;00003
Now let's write a function that takes several filenames and iterates through the lines in all of the files, returning only the zip codes:
import csv
def iter_zipcodes(paths):
for path in paths:
with open(path) as fh:
for row in csv.reader(fh, delimiter=';'):
yield row[1]
Note that we write yield row[1]
. This signals that the function is a generator , and it returns its values lazily.
We can test it out as follows:
>>> list(iter_zipcodes(['zc1.txt', 'zc2.txt']))
['00001', '00002', '00003', '00001', '00001', '00003']
So we see that the generator simply spits out the zip codes in each file, in order. Now let's count them:
>>> zipcodes = iter_zipcodes(['zc1.txt', 'zc2.txt'])
>>> counts = collections.Counter(zipcodes)
>>> counts
Counter({'00001': 3, '00002': 1, '00003': 2})
Looks like it worked. This approach is efficient because it only reads one line in at a time. When one file is completely read, it moves on to the next.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.