简体   繁体   中英

Faster way to check for value in csv?

I have some code that looks up from a csv, then goes to retrieve it from google maps if it's not present in the csv. I have 100,000+ records and it's taking ~2 hours. Any ideas on how to speed this up? Thanks!

from csv import DictReader
import codecs

def find_school(high_school, city, state):
    types_of_encoding = ["utf8"]
    for encoding_type in types_of_encoding:
        with codecs.open('C:/high_schools.csv', encoding=encoding_type, errors='replace') as csvfile:
            reader = DictReader(csvfile)
            for row in reader:
            #checks the csv file and sees if the high school already exists
                if (row['high_school'] == high_school.upper() and
                    row['city'] == city.upper() and
                    row['state'] == state.upper()):
                    return dict(row)['zipcode'],dict(row)['latitude'],dict(row)['longitude'],dict(row)['place_id']
                else:
                    #hits Google Maps api
#executes
df['zip'],df['latitude'], df['longitude'], df['place_id'] = zip(*df.apply(lambda row: find_school(row['high_school'].strip(), row['City'].strip(), row['State'].strip()), axis=1))

CSV FILE SNIPPET

high_school,city,state,address,zipcode,latitude,longitude,place_id,country,location_type
GEORGIA MILITARY COLLEGE,MILLEDGEVILLE,GA,"201 E GREENE ST, MILLEDGEVILLE, GA 31061, USA",31061,33.0789184,-83.2235169,ChIJv0wUz97H9ogRwuKm_HC-lu8,USA,UNIVERSITY
BOWIE,BOWIE,MD,"15200 ANNAPOLIS RD, BOWIE, MD 20715, USA",20715,38.9780387,-76.7435378,ChIJRWh2C1fpt4kR6XFWnAm5yAE,USA,SCHOOL
EVERGLADES,MIRAMAR,FL,"17100 SW 48TH CT, MIRAMAR, FL 33027, USA",33027,25.9696495,-80.3737813,ChIJQfmM_I6j2YgR1Hdq0CC4apo,USA,SCHOOL

There is no point reading the file every single time you want to make a check. Just load the file once into memory and create a new dictionary with the fields you're interested in as part of a tuple key.

import csv

lookup_dict = {}
with open('C:/Users/Josh/Desktop/test.csv') as infile:
    reader = csv.DictReader(infile)
    for row in reader:
        lookup_dict[(row['high_school'].lower(), row['city'].lower(),
                    row['state'].lower())] = row

Now you only have to check whether a value you want to test for is already a key in lookup_dict . If it's not, then you query Google Maps.

Since your edit shows that you're using this to apply to a dataframe, you should calculate lookup_dict outside of the function and pass it as an argument. That way, the file is still only read once.

lookup_dict = {}
with open('C:/Users/Josh/Desktop/test.csv') as infile:
    reader = csv.DictReader(infile)
    for row in reader:
        lookup_dict[(row['high_school'].lower(), row['city'].lower(),
                    row['state'].lower())] = row

def find_school(high_school, city, state, lookup_dict):
    result = lookup_dict.get((high_school.lower(), city.lower(), state.lower()))
    if result:
        return result
    else:
        # Google query
        pass

a = find_school('georgia military college', 'milledgeville', 'ga', lookup_dict)
#df['zip'],df['latitude'], df['longitude'], df['place_id'] = zip(*df.apply(lambda row: find_school(row['high_school'].strip(), row['City'].strip(), row['State'].strip()), axis=1))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM