Map 通过搜索另一个新列的值 dataframe

Question

I have two dataframes: df_geo and df_event .我有两个数据框： df_geo和df_event 。 I want to create two new columns in df_event .我想在df_event创建两个新列。 The data frames resemble the following, although additional columns have been removed for the sake of simplicity:数据框类似于以下内容，但为简单起见删除了其他列：

data_geo =  [['040','01','000','00000','00000','00000','Alabama'],
             ['050','01','001','00000','00000','00000','Autauga County'],
             ['050','01','097','00000','00000','00000','Mobile County'],
             ['050','01','101','00000','00000','00000','Montgomery County'],
             ['050','01','115','00000','00000','00000','St. Clair County'],
             ['040','09','000','00000','00000','00000','Connecticut'],
             ['061','09','001','04720','00000','00000','Bethel town'],
             ['040','17','000','00000','00000','00000','Illinois'],
             ['061','17','109','05638','00000','00000','Bethel township'],
             ['050','17','163','00000','00000','00000','St. Clair County']] 

dfgeo = pd.DataFrame(data_geo, columns = ['summary_level', 'state_fips','county_fips','subdivision_code_fips','place_code_fips','city_code_fips','area_name']) 

df_geo.info()

RangeIndex: 43847 entries, 0 to 43846
Data columns (total 7 columns):
summary_level            43847 non-null object
state_fips               43847 non-null object
county_fips              43847 non-null object
subdivision_code_fips    43847 non-null object
place_code_fips          43847 non-null object
city_code_fips           43847 non-null object
area_name                43847 non-null object

data_event = [['event_id','_','Alabama'], 
              ['event_id','_','Connecticut'],
              ['event_id','Autauga County','Alabama'],
              ['event_id','Fairfield County','Connecticut'],
              ['event_id','Fairbanks North Star Borough','Alaska']] 

df_event = pd.DataFrame(data_event, columns = ['event_id','county','state']) 

df_event.info()

RangeIndex: 1261 entries, 0 to 1260
Data columns (total 3 columns):
event_id                1261 non-null object
county                   999 non-null object
state                   1261 non-null object
dtypes: object(3)

GOAL to create a function that can take the county and state inputs from df_event and in order to create two new columns in the same dataframe.目标是创建一个 function，它可以从df_event获取county和state输入，并在同一 dataframe 中创建两个新列。 The new columns are based on the values of state_fips and county_fips in df_geo .新列基于 df_geo 中state_fips和county_fips的df_geo 。 An example of this would look like the following:这方面的一个示例如下所示：

inputA fun('df_geo','Connecticut','Fairfield County'):   

resultA = ['event_id','Connecticut','Fairfield County','09','001']
                                                       ^New columns

inputB fun('df_geo','Alaska','Fairbanks North Star Borough'):   

resultB = ['event_id','Alaska','Fairbanks North Star Borough','02','090']
                                                              ^New columns

This is a PROBLEM because I also need to use this function on a list of 1,200 (and growing) events the function would have to work within a lamba function or something else that can map it across the entire dataframe. This is a PROBLEM because I also need to use this function on a list of 1,200 (and growing) events the function would have to work within a lamba function or something else that can map it across the entire dataframe.

This is complicated by identical County names like "St. Clair County" that appear in several states.这因出现在几个州的相同县名（例如“圣克莱尔县”）而变得复杂。 Although their area_names are identical the value of state_fips will be different.尽管它们的area_names相同，但state_fips的值将不同。

The state_fips of St. Clair Illinois is 17 , the same as all other counties in Illinois and the state itself.圣克莱尔伊利诺伊州的state_fips是17 ，与伊利诺伊州的所有其他县和 state 本身相同。 The state_fips of St. Clair Alabama is 01 , the same as all other counties in Alabama, and so on...圣克莱尔阿拉巴马州的state_fips是01 ，与阿拉巴马州的所有其他县相同，依此类推...

I would like to use the same search and map function all the way down to city_code_fips .我想使用相同的搜索和 map function 一直到city_code_fips 。 At that level any search terms have to be exactly the same to avoid picking up "Bethel town" when I intend to find "Bethel township".在那个级别，任何搜索词都必须完全相同，以避免在我打算查找“Bethel township”时选择“Bethel town”。 Exact inputs are also important because some states, like Louisiana, call their county level geographies by another name.准确的输入也很重要，因为一些州，如路易斯安那州，用另一个名字来称呼他们的县级地理。

In df_event a '_' indicates that the county is unknown.在df_event中，“_”表示该县是未知的。

df_event['event_id'] is a unique string. df_event['event_id']是一个唯一的字符串。 There are rows in the dataframe that are nearly identical but with different ids indicating that an event has occurred on multiple occasions. dataframe 中有几行几乎相同，但具有不同的 ID，表明事件已多次发生。 This has no affect on the.这对没有影响。 state_fips or county_fips . state_fips或county_fips 。

I know this is a multi step process but all help is appreciated.我知道这是一个多步骤的过程，但感谢所有帮助。 Thank you.谢谢你。

Answer 1

You can do this using df.merge :您可以使用df.merge执行此操作：

In [289]: df_event['state_fips'] = df_event.merge(dfgeo[['state_fips','area_name']], left_on='state', right_on='area_name', how='left')['state_fips']    
In [290]: df_event['county_fips'] = df_event.merge(dfgeo[['county_fips','area_name']], left_on='county', right_on='area_name', how='left')['county_fips']

In [291]: df_event
Out[291]: 
  unique_str                        county        state state_fips county_fips
0   Event Id                             _      Alabama         01         NaN
1   Event Id                             _  Connecticut         09         NaN
2   Event Id                Autauga County      Alabama         01         001
3   Event Id              Fairfield County  Connecticut         09         001
4   Event Id  Fairbanks North Star Borough       Alaska         02         090

Answer 2

If there are duplicates in use area_name column first remove them by DataFrame.drop_duplicates :如果使用area_name列中存在重复项，请首先通过DataFrame.drop_duplicates将其删除：

dfgeo = dfgeo.drop_duplicates('area_name')

And then Series.map , what is faster like merge , so it should be preferable:然后是Series.map ，像merge更快，所以应该更可取：

df_event['state_fips'] = df_event['state'].map(dfgeo.set_index('area_name')['state_fips'])
df_event['county_fips'] = df_event['county'].map(dfgeo.set_index('area_name')['county_fips'])
print (df_event)
  unique_str                        county        state state_fips county_fips
0   Event Id                             _      Alabama         01         NaN
1   Event Id                             _  Connecticut         09         NaN
2   Event Id                Autauga County      Alabama         01         001
3   Event Id              Fairfield County  Connecticut         09         001
4   Event Id  Fairbanks North Star Borough       Alaska         02         090

Map 通过搜索另一个新列的值 dataframe

问题描述

2 个解决方案

解决方案1
1 2020-05-29 05:42:03

解决方案2
0 2020-05-29 05:37:07

Map 通过搜索另一个新列的值 dataframe

问题描述

2 个解决方案

解决方案1 1 2020-05-29 05:42:03

解决方案2 0 2020-05-29 05:37:07

解决方案1
1 2020-05-29 05:42:03

解决方案2
0 2020-05-29 05:37:07