[英]Map the value of a new column by searching another dataframe
我有两个数据框: df_geo
和df_event
。 我想在df_event
创建两个新列。 数据框类似于以下内容,但为简单起见删除了其他列:
data_geo = [['040','01','000','00000','00000','00000','Alabama'],
['050','01','001','00000','00000','00000','Autauga County'],
['050','01','097','00000','00000','00000','Mobile County'],
['050','01','101','00000','00000','00000','Montgomery County'],
['050','01','115','00000','00000','00000','St. Clair County'],
['040','09','000','00000','00000','00000','Connecticut'],
['061','09','001','04720','00000','00000','Bethel town'],
['040','17','000','00000','00000','00000','Illinois'],
['061','17','109','05638','00000','00000','Bethel township'],
['050','17','163','00000','00000','00000','St. Clair County']]
dfgeo = pd.DataFrame(data_geo, columns = ['summary_level', 'state_fips','county_fips','subdivision_code_fips','place_code_fips','city_code_fips','area_name'])
df_geo.info()
RangeIndex: 43847 entries, 0 to 43846
Data columns (total 7 columns):
summary_level 43847 non-null object
state_fips 43847 non-null object
county_fips 43847 non-null object
subdivision_code_fips 43847 non-null object
place_code_fips 43847 non-null object
city_code_fips 43847 non-null object
area_name 43847 non-null object
data_event = [['event_id','_','Alabama'],
['event_id','_','Connecticut'],
['event_id','Autauga County','Alabama'],
['event_id','Fairfield County','Connecticut'],
['event_id','Fairbanks North Star Borough','Alaska']]
df_event = pd.DataFrame(data_event, columns = ['event_id','county','state'])
df_event.info()
RangeIndex: 1261 entries, 0 to 1260
Data columns (total 3 columns):
event_id 1261 non-null object
county 999 non-null object
state 1261 non-null object
dtypes: object(3)
目标是创建一个 function,它可以从df_event
获取county
和state
输入,并在同一 dataframe 中创建两个新列。 新列基于 df_geo 中state_fips
和county_fips
的df_geo
。 这方面的一个示例如下所示:
inputA fun('df_geo','Connecticut','Fairfield County'):
resultA = ['event_id','Connecticut','Fairfield County','09','001']
^New columns
inputB fun('df_geo','Alaska','Fairbanks North Star Borough'):
resultB = ['event_id','Alaska','Fairbanks North Star Borough','02','090']
^New columns
This is a PROBLEM because I also need to use this function on a list of 1,200 (and growing) events the function would have to work within a lamba function or something else that can map it across the entire dataframe.
这因出现在几个州的相同县名(例如“圣克莱尔县”)而变得复杂。 尽管它们的area_names
相同,但state_fips
的值将不同。
圣克莱尔伊利诺伊州的state_fips
是17 ,与伊利诺伊州的所有其他县和 state 本身相同。 圣克莱尔阿拉巴马州的state_fips
是01 ,与阿拉巴马州的所有其他县相同,依此类推...
我想使用相同的搜索和 map function 一直到city_code_fips
。 在那个级别,任何搜索词都必须完全相同,以避免在我打算查找“Bethel township”时选择“Bethel town”。 准确的输入也很重要,因为一些州,如路易斯安那州,用另一个名字来称呼他们的县级地理。
在df_event
中,“_”表示该县是未知的。
df_event['event_id']
是一个唯一的字符串。 dataframe 中有几行几乎相同,但具有不同的 ID,表明事件已多次发生。 这对 没有影响。 state_fips
或county_fips
。
我知道这是一个多步骤的过程,但感谢所有帮助。 谢谢你。
您可以使用df.merge
执行此操作:
In [289]: df_event['state_fips'] = df_event.merge(dfgeo[['state_fips','area_name']], left_on='state', right_on='area_name', how='left')['state_fips']
In [290]: df_event['county_fips'] = df_event.merge(dfgeo[['county_fips','area_name']], left_on='county', right_on='area_name', how='left')['county_fips']
In [291]: df_event
Out[291]:
unique_str county state state_fips county_fips
0 Event Id _ Alabama 01 NaN
1 Event Id _ Connecticut 09 NaN
2 Event Id Autauga County Alabama 01 001
3 Event Id Fairfield County Connecticut 09 001
4 Event Id Fairbanks North Star Borough Alaska 02 090
如果使用area_name
列中存在重复项,请首先通过DataFrame.drop_duplicates
将其删除:
dfgeo = dfgeo.drop_duplicates('area_name')
然后是Series.map
,像merge
更快,所以应该更可取:
df_event['state_fips'] = df_event['state'].map(dfgeo.set_index('area_name')['state_fips'])
df_event['county_fips'] = df_event['county'].map(dfgeo.set_index('area_name')['county_fips'])
print (df_event)
unique_str county state state_fips county_fips
0 Event Id _ Alabama 01 NaN
1 Event Id _ Connecticut 09 NaN
2 Event Id Autauga County Alabama 01 001
3 Event Id Fairfield County Connecticut 09 001
4 Event Id Fairbanks North Star Borough Alaska 02 090
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.