简体   繁体   中英

Combine two data frames without a common column

I am adding a column "state" into an existing dataframe that does not share a common column with my other data frame. Therefore, I need to convert zipcodes into states (example, 00704 would be PR) to load into the dataframe that has the new column state.

reviewers = pd.read_csv('reviewers.txt', 
                        sep='|',
                        header=None,
                        names=['user id','age','gender','occupation','zipcode'])
reviewers['state'] = ""

  user id  age gender       occupation    zipcode    state
0          1   24      M     technician   85711      
1          2   53      F          other   94043      


zipcodes = pd.read_csv('zipcodes.txt',
                  usecols = [1,4],
                  converters={'Zipcode':str})
      Zipcode State
0       00704    PR
1       00704    PR
2       00704    PR
3       00704    PR
4       00704    PR


zipcodes1 = zipcodes.set_index('Zipcode') ###Setting the index to zipcode
dfzip = zipcodes1
print(dfzip)


        State
Zipcode      
00704      PR
00704      PR
00704      PR



zips = (pd.Series(dfzip.values.tolist(), index = zipcodes1['State'].index))

states = []
for zipcode in reviewers['Zipcode']:
    if re.search('[a-zA-Z]+', zipcode):
        append.states['canada']
    elif zipcode in zips.index:
        append.states(zips['zipcode'])
    else:
        append.states('unkown')

I am not sure if my loop is correct either. I have to sort the zipcodes by US zipcode (numerical), Canada zip codes(alphabetical), and then other zip codes which we define as (unknown). Let me know if you need the data file.

Your loop needs to be fixed:

states = []
for zipcode in reviewers['Zipcode']:
    if re.match(r'\w+', zipcode):
        states.extend('Canada')
    elif zipcode in zips.index:
        states.extend(zips[zipcode])
    else:
        states.extend('Unknown')

Also, am assuming you want the states list to be plugged back into the dataframe. In that case you don't need the for loop. You can use pandas apply on the dataframe to get a new column:

def findState(code):
       res='Unknown'
       if re.match(r'\w+', code):
            res='Canada'
        elif code in zips.index:
            res=zips[code]              
        return res

reviewers['State'] = reviewers['Zipcode'].apply(findstate)

Use:

#remove duplicates and create Series for mapping
zips = zipcodes.drop_duplicates().set_index('Zipcode')['State']

#get mask for canada zip codes
#if possible small letters change to [a-zA-Z]+
mask = reviewers['zipcode'].str.match('[A-Z]+') 
#new column by mask
reviewers['state'] = np.where(mask, 'canada', reviewers['zipcode'].map(zips))
#NaNs are replaced 
reviewers['state'] = reviewers['state'].fillna('unknown')

Loop version with apply :

import re 

def f(code):
    res="unknown"
    #if possible small letter change to [a-zA-Z]+
    if re.match('[A-Z]+', code):
        res='canada'
    elif code in zips.index:
        res=zips[code]
    return res

reviewers['State1'] = reviewers['zipcode'].apply(f)
print (reviewers.tail(10))  
     user id  age gender     occupation zipcode state State1
933      934   61      M       engineer   22902    VA     VA
934      935   42      M         doctor   66221    KS     KS
935      936   24      M          other   32789    FL     FL
936      937   48      M       educator   98072    WA     WA
937      938   38      F     technician   55038    MN     MN
938      939   26      F        student   33319    FL     FL
939      940   32      M  administrator   02215    MA     MA
940      941   20      M        student   97229    OR     OR
941      942   48      F      librarian   78209    TX     TX
942      943   22      M        student   77841    TX     TX

#test if same output
print ((reviewers['State1'] == reviewers['state']).all())
True

Timings :

In [56]: %%timeit
    ...: mask = reviewers['zipcode'].str.match('[A-Z]+') 
    ...: reviewers['state'] = np.where(mask, 'canada', reviewers['zipcode'].map(zips))
    ...: reviewers['state'] = reviewers['state'].fillna('unknown')
    ...: 
100 loops, best of 3: 2.08 ms per loop

In [57]: %%timeit
    ...: reviewers['State1'] = reviewers['zipcode'].apply(f)
    ...: 
100 loops, best of 3: 17 ms per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM