简体   繁体   English

合并两个数据框而无需公共列

[英]Combine two data frames without a common column

I am adding a column "state" into an existing dataframe that does not share a common column with my other data frame. 我在现有数据框中添加了一个“状态”列,该数据框与我的其他数据框不共享公共列。 Therefore, I need to convert zipcodes into states (example, 00704 would be PR) to load into the dataframe that has the new column state. 因此,我需要将邮政编码转换为状态(例如00704将为PR),以加载到具有新列状态的数据框中。

reviewers = pd.read_csv('reviewers.txt', 
                        sep='|',
                        header=None,
                        names=['user id','age','gender','occupation','zipcode'])
reviewers['state'] = ""

  user id  age gender       occupation    zipcode    state
0          1   24      M     technician   85711      
1          2   53      F          other   94043      


zipcodes = pd.read_csv('zipcodes.txt',
                  usecols = [1,4],
                  converters={'Zipcode':str})
      Zipcode State
0       00704    PR
1       00704    PR
2       00704    PR
3       00704    PR
4       00704    PR


zipcodes1 = zipcodes.set_index('Zipcode') ###Setting the index to zipcode
dfzip = zipcodes1
print(dfzip)


        State
Zipcode      
00704      PR
00704      PR
00704      PR



zips = (pd.Series(dfzip.values.tolist(), index = zipcodes1['State'].index))

states = []
for zipcode in reviewers['Zipcode']:
    if re.search('[a-zA-Z]+', zipcode):
        append.states['canada']
    elif zipcode in zips.index:
        append.states(zips['zipcode'])
    else:
        append.states('unkown')

I am not sure if my loop is correct either. 我不确定我的循环是否正确。 I have to sort the zipcodes by US zipcode (numerical), Canada zip codes(alphabetical), and then other zip codes which we define as (unknown). 我必须按美国邮政编码(数字),加拿大邮政编码(字母)和其他我们定义为(未知)的邮政编码对邮政编码进行排序。 Let me know if you need the data file. 让我知道您是否需要数据文件。

Your loop needs to be fixed: 您的循环需要修复:

states = []
for zipcode in reviewers['Zipcode']:
    if re.match(r'\w+', zipcode):
        states.extend('Canada')
    elif zipcode in zips.index:
        states.extend(zips[zipcode])
    else:
        states.extend('Unknown')

Also, am assuming you want the states list to be plugged back into the dataframe. 另外,假设您希望将状态列表重新插入数据框。 In that case you don't need the for loop. 在这种情况下,您不需要for循环。 You can use pandas apply on the dataframe to get a new column: 您可以在数据框上使用pandas apply获取新列:

def findState(code):
       res='Unknown'
       if re.match(r'\w+', code):
            res='Canada'
        elif code in zips.index:
            res=zips[code]              
        return res

reviewers['State'] = reviewers['Zipcode'].apply(findstate)

Use: 采用:

#remove duplicates and create Series for mapping
zips = zipcodes.drop_duplicates().set_index('Zipcode')['State']

#get mask for canada zip codes
#if possible small letters change to [a-zA-Z]+
mask = reviewers['zipcode'].str.match('[A-Z]+') 
#new column by mask
reviewers['state'] = np.where(mask, 'canada', reviewers['zipcode'].map(zips))
#NaNs are replaced 
reviewers['state'] = reviewers['state'].fillna('unknown')

Loop version with apply : 循环版本与apply

import re 

def f(code):
    res="unknown"
    #if possible small letter change to [a-zA-Z]+
    if re.match('[A-Z]+', code):
        res='canada'
    elif code in zips.index:
        res=zips[code]
    return res

reviewers['State1'] = reviewers['zipcode'].apply(f)
print (reviewers.tail(10))  
     user id  age gender     occupation zipcode state State1
933      934   61      M       engineer   22902    VA     VA
934      935   42      M         doctor   66221    KS     KS
935      936   24      M          other   32789    FL     FL
936      937   48      M       educator   98072    WA     WA
937      938   38      F     technician   55038    MN     MN
938      939   26      F        student   33319    FL     FL
939      940   32      M  administrator   02215    MA     MA
940      941   20      M        student   97229    OR     OR
941      942   48      F      librarian   78209    TX     TX
942      943   22      M        student   77841    TX     TX

#test if same output
print ((reviewers['State1'] == reviewers['state']).all())
True

Timings : 时间

In [56]: %%timeit
    ...: mask = reviewers['zipcode'].str.match('[A-Z]+') 
    ...: reviewers['state'] = np.where(mask, 'canada', reviewers['zipcode'].map(zips))
    ...: reviewers['state'] = reviewers['state'].fillna('unknown')
    ...: 
100 loops, best of 3: 2.08 ms per loop

In [57]: %%timeit
    ...: reviewers['State1'] = reviewers['zipcode'].apply(f)
    ...: 
100 loops, best of 3: 17 ms per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM