简体   繁体   中英

Adding Columns to pandas dataframe & iterating through one of the columns

I have loaded in a dataframe with a number of columns, one of which includes an address. I'm using a python geocoder module to get lat/long for every address in this csv.

Pandas

1) How do I add new columns? Should I add the columns as I iterrate through the rows, or should I add columns at the start?

2) In my code below, I am trying to iterate through every row in the data frame. For every row, I am performing the geocoder.google() method. Column 16 of my csv/data frame contains an address.

How would I refer to that address column whilst iterating through all the rows? I get "IndexError: tuple index out of range" if I run the code as it is.

CSV

3) The 2nd part of my code does a similar thing with the CSV modules. I read in a CSV, loop through every row and perform the geocoder method as said before. The geocoder method returns a list of 2 values (2 coordinates - [XXXX,XXXX]). I am trying to write to the original rows and then two more columns with each of the two coordinates afterwards. I am getting "TypeError: can only concatenate list (not "float") to list"

import geocoder 
import csv
import pandas as pd
import time

df = pd.read_csv("RSM100_1995.csv",header=None)
print(df.head())
for row in df.iterrows():
   g = geocoder.google(row[16])
   print(row[16],g.latlng)
   time.sleep(2)

with open("RSM100_1995.csv","r") as f, open("RSM_GCTest.csv","w",newline='') as g:
    rdr = csv.reader(f)
    wtr = csv.writer(g)
    for r in rdr:
        gc = geocoder.google(str(r[16]))
        print(r[16],gc.latlng)
        wtr.writerow(r + gc.latlng[0]+gc.latlng[1])
        time.sleep(2)

By the way, I am using time.sleep(2) since the geocoder has a limit to the number of requests. I don't run the code as it is here, just put it like this to display it.

If anyone has a better way of geocoding UK addresses using Python, let me know.


Edit:

For Chirag - I've made the changes you mentioned. I've tried replacing 'Address' in the code below with the column index (which is 16) with the same result.

I've added column headers with X.columns

I'm now getting a very long error message linking many different files.

RS1995 = pd.read_csv("RSM100_1995.csv",header=None)

RS1995.columns = ['ID','Price','Date','Postcode','X','Y','Z','PAON','SAON','Street','Locality','District','City','County','A','B','Address','XX']
print(RS1995.head())
for row in RS1995.iterrows():
    RS1995['lat'] = geocoder.google(RS1995['Address']).latlng[0]
    RS1995['lng'] = geocoder.google(RS1995['Address']).latlng[1]
    print(RS1995.head())
    time.sleep(2)

In terms of the CSV - there are 17 columns, i've titled them up above. The 'Address' column is the one I want to pass through the geocoder. The Address column itself is a concatenation of 'PAON', 'SAON', 'Street','Locality','County' & 'Postcode'. I could've included 'City' too, but all the concatenation I did using the CSV module.

If it helps - here is the Geocoder link:

http://geocoder.readthedocs.io/


Edit 2:

RS1995 = pd.read_csv("RSM100_1995.csv",header=None)

RS1995.columns = ['ID','Price','Date','Postcode','X','Y','Z','PAON','SAON','Street','Locality','District','City','County','A','B','Address','XX']
print(RS1995.head())

RS1995['lat'] = "x"
RS1995['lng'] = "y"
print(RS1995.head())
for row in RS1995.iterrows():
    print(row)

Whenever I do run this code above, I get this. I've just taken the last two as an example. What does this mean? How would I iterrate through every row, geocode the address and wait 2 seconds so I don't surpass the rate limit?:

(98, ID                     {40E4DAC0-863F-42FE-94B4-49A70D3BE0B9}
Price                                                   43000
Date                                         24/02/1995 00:00
Postcode                                             WS12 3XJ
X                                                           S
Y                                                           N
Z                                                           F
PAON                                                        1
SAON                                                      NaN
Street                                           WOODFORD WAY
Locality                                          HEATH HAYES
District                                              CANNOCK
City                                            CANNOCK CHASE
County                                          STAFFORDSHIRE
A                                                           A
B                                                           A
Address     1  WOODFORD WAY HEATH HAYES STAFFORDSHIRE WS12...
XX          1  WOODFORD WAY HEATH HAYES STAFFORDSHIRE WS12...
lat                                                         x
lng                                                         y
Name: 98, dtype: object)
(99, ID                  {061625F8-82D5-43CF-A55F-4288979D31EC}
Price                                                42995
Date                                      01/09/1995 00:00
Postcode                                           PO1 5AY
X                                                        T
Y                                                        N
Z                                                        F
PAON                                                    67
SAON                                                   NaN
Street                                        BYERLEY ROAD
Locality                                        PORTSMOUTH
District                                        PORTSMOUTH
City                                            PORTSMOUTH
County                                          PORTSMOUTH
A                                                        A
B                                                        A
Address     67  BYERLEY ROAD PORTSMOUTH PORTSMOUTH PO1 5AY
XX          67  BYERLEY ROAD PORTSMOUTH PORTSMOUTH PO1 5AY
lat                                                      x
lng                                                      y
Name: 99, dtype: object)

You can create new columns in a pandas dataframe similar to how you would use an associative array or dictionary. You can create two new columns for your latitude and longitude like so:

df['lat'] = geocoder.google(df[16]).latlng[0]
df['lng'] = geocoder.google(df[16]).latlng[1]

Then you can write the entire dataframe to a csv:

df.to_csv('RSM_GCTest.csv')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM