繁体   English   中英

Map 通过搜索另一个新列的值 dataframe

[英]Map the value of a new column by searching another dataframe

我有两个数据框: df_geodf_event 我想在df_event创建两个新列。 数据框类似于以下内容,但为简单起见删除了其他列:

data_geo =  [['040','01','000','00000','00000','00000','Alabama'],
             ['050','01','001','00000','00000','00000','Autauga County'],
             ['050','01','097','00000','00000','00000','Mobile County'],
             ['050','01','101','00000','00000','00000','Montgomery County'],
             ['050','01','115','00000','00000','00000','St. Clair County'],
             ['040','09','000','00000','00000','00000','Connecticut'],
             ['061','09','001','04720','00000','00000','Bethel town'],
             ['040','17','000','00000','00000','00000','Illinois'],
             ['061','17','109','05638','00000','00000','Bethel township'],
             ['050','17','163','00000','00000','00000','St. Clair County']] 

dfgeo = pd.DataFrame(data_geo, columns = ['summary_level', 'state_fips','county_fips','subdivision_code_fips','place_code_fips','city_code_fips','area_name']) 

df_geo.info()

RangeIndex: 43847 entries, 0 to 43846
Data columns (total 7 columns):
summary_level            43847 non-null object
state_fips               43847 non-null object
county_fips              43847 non-null object
subdivision_code_fips    43847 non-null object
place_code_fips          43847 non-null object
city_code_fips           43847 non-null object
area_name                43847 non-null object
data_event = [['event_id','_','Alabama'], 
              ['event_id','_','Connecticut'],
              ['event_id','Autauga County','Alabama'],
              ['event_id','Fairfield County','Connecticut'],
              ['event_id','Fairbanks North Star Borough','Alaska']] 

df_event = pd.DataFrame(data_event, columns = ['event_id','county','state']) 

df_event.info()

RangeIndex: 1261 entries, 0 to 1260
Data columns (total 3 columns):
event_id                1261 non-null object
county                   999 non-null object
state                   1261 non-null object
dtypes: object(3) 

目标是创建一个 function,它可以从df_event获取countystate输入,并在同一 dataframe 中创建两个新列。 新列基于 df_geo 中state_fipscounty_fipsdf_geo 这方面的一个示例如下所示:

inputA fun('df_geo','Connecticut','Fairfield County'):   

resultA = ['event_id','Connecticut','Fairfield County','09','001']
                                                       ^New columns

inputB fun('df_geo','Alaska','Fairbanks North Star Borough'):   

resultB = ['event_id','Alaska','Fairbanks North Star Borough','02','090']
                                                              ^New columns

This is a PROBLEM because I also need to use this function on a list of 1,200 (and growing) events the function would have to work within a lamba function or something else that can map it across the entire dataframe.

这因出现在几个州的相同县名(例如“圣克莱尔县”)而变得复杂。 尽管它们的area_names相同,但state_fips的值将不同。

圣克莱尔伊利诺伊州的state_fips17 ,与伊利诺伊州的所有其他县和 state 本身相同。 圣克莱尔阿拉巴马州的state_fips01 ,与阿拉巴马州的所有其他县相同,依此类推...

我想使用相同的搜索和 map function 一直到city_code_fips 在那个级别,任何搜索词都必须完全相同,以避免在我打算查找“Bethel township”时选择“Bethel town”。 准确的输入也很重要,因为一些州,如路易斯安那州,用另一个名字来称呼他们的县级地理。

df_event中,“_”表示该县是未知的。

df_event['event_id']是一个唯一的字符串。 dataframe 中有几行几乎相同,但具有不同的 ID,表明事件已多次发生。 这对 没有影响。 state_fipscounty_fips

我知道这是一个多步骤的过程,但感谢所有帮助。 谢谢你。

您可以使用df.merge执行此操作:

In [289]: df_event['state_fips'] = df_event.merge(dfgeo[['state_fips','area_name']], left_on='state', right_on='area_name', how='left')['state_fips']    
In [290]: df_event['county_fips'] = df_event.merge(dfgeo[['county_fips','area_name']], left_on='county', right_on='area_name', how='left')['county_fips']

In [291]: df_event
Out[291]: 
  unique_str                        county        state state_fips county_fips
0   Event Id                             _      Alabama         01         NaN
1   Event Id                             _  Connecticut         09         NaN
2   Event Id                Autauga County      Alabama         01         001
3   Event Id              Fairfield County  Connecticut         09         001
4   Event Id  Fairbanks North Star Borough       Alaska         02         090

如果使用area_name列中存在重复项,请首先通过DataFrame.drop_duplicates将其删除:

dfgeo = dfgeo.drop_duplicates('area_name')

然后是Series.map ,像merge更快,所以应该更可取:

df_event['state_fips'] = df_event['state'].map(dfgeo.set_index('area_name')['state_fips'])
df_event['county_fips'] = df_event['county'].map(dfgeo.set_index('area_name')['county_fips'])
print (df_event)
  unique_str                        county        state state_fips county_fips
0   Event Id                             _      Alabama         01         NaN
1   Event Id                             _  Connecticut         09         NaN
2   Event Id                Autauga County      Alabama         01         001
3   Event Id              Fairfield County  Connecticut         09         001
4   Event Id  Fairbanks North Star Borough       Alaska         02         090

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM