[英]Map the value of a new column by searching another dataframe
I have two dataframes: df_geo
and df_event
.我有两个数据框:
df_geo
和df_event
。 I want to create two new columns in df_event
.我想在
df_event
创建两个新列。 The data frames resemble the following, although additional columns have been removed for the sake of simplicity:数据框类似于以下内容,但为简单起见删除了其他列:
data_geo = [['040','01','000','00000','00000','00000','Alabama'],
['050','01','001','00000','00000','00000','Autauga County'],
['050','01','097','00000','00000','00000','Mobile County'],
['050','01','101','00000','00000','00000','Montgomery County'],
['050','01','115','00000','00000','00000','St. Clair County'],
['040','09','000','00000','00000','00000','Connecticut'],
['061','09','001','04720','00000','00000','Bethel town'],
['040','17','000','00000','00000','00000','Illinois'],
['061','17','109','05638','00000','00000','Bethel township'],
['050','17','163','00000','00000','00000','St. Clair County']]
dfgeo = pd.DataFrame(data_geo, columns = ['summary_level', 'state_fips','county_fips','subdivision_code_fips','place_code_fips','city_code_fips','area_name'])
df_geo.info()
RangeIndex: 43847 entries, 0 to 43846
Data columns (total 7 columns):
summary_level 43847 non-null object
state_fips 43847 non-null object
county_fips 43847 non-null object
subdivision_code_fips 43847 non-null object
place_code_fips 43847 non-null object
city_code_fips 43847 non-null object
area_name 43847 non-null object
data_event = [['event_id','_','Alabama'],
['event_id','_','Connecticut'],
['event_id','Autauga County','Alabama'],
['event_id','Fairfield County','Connecticut'],
['event_id','Fairbanks North Star Borough','Alaska']]
df_event = pd.DataFrame(data_event, columns = ['event_id','county','state'])
df_event.info()
RangeIndex: 1261 entries, 0 to 1260
Data columns (total 3 columns):
event_id 1261 non-null object
county 999 non-null object
state 1261 non-null object
dtypes: object(3)
GOAL to create a function that can take the county
and state
inputs from df_event
and in order to create two new columns in the same dataframe.目标是创建一个 function,它可以从
df_event
获取county
和state
输入,并在同一 dataframe 中创建两个新列。 The new columns are based on the values of state_fips
and county_fips
in df_geo
.新列基于 df_geo 中
state_fips
和county_fips
的df_geo
。 An example of this would look like the following:这方面的一个示例如下所示:
inputA fun('df_geo','Connecticut','Fairfield County'):
resultA = ['event_id','Connecticut','Fairfield County','09','001']
^New columns
inputB fun('df_geo','Alaska','Fairbanks North Star Borough'):
resultB = ['event_id','Alaska','Fairbanks North Star Borough','02','090']
^New columns
This is a PROBLEM because I also need to use this function on a list of 1,200 (and growing) events the function would have to work within a lamba function or something else that can map it across the entire dataframe. This is a PROBLEM because I also need to use this function on a list of 1,200 (and growing) events the function would have to work within a lamba function or something else that can map it across the entire dataframe.
This is complicated by identical County names like "St. Clair County" that appear in several states.这因出现在几个州的相同县名(例如“圣克莱尔县”)而变得复杂。 Although their
area_names
are identical the value of state_fips
will be different.尽管它们的
area_names
相同,但state_fips
的值将不同。
The state_fips
of St. Clair Illinois is 17 , the same as all other counties in Illinois and the state itself.圣克莱尔伊利诺伊州的
state_fips
是17 ,与伊利诺伊州的所有其他县和 state 本身相同。 The state_fips
of St. Clair Alabama is 01 , the same as all other counties in Alabama, and so on...圣克莱尔阿拉巴马州的
state_fips
是01 ,与阿拉巴马州的所有其他县相同,依此类推...
I would like to use the same search and map function all the way down to city_code_fips
.我想使用相同的搜索和 map function 一直到
city_code_fips
。 At that level any search terms have to be exactly the same to avoid picking up "Bethel town" when I intend to find "Bethel township".在那个级别,任何搜索词都必须完全相同,以避免在我打算查找“Bethel township”时选择“Bethel town”。 Exact inputs are also important because some states, like Louisiana, call their county level geographies by another name.
准确的输入也很重要,因为一些州,如路易斯安那州,用另一个名字来称呼他们的县级地理。
In df_event
a '_' indicates that the county is unknown.在
df_event
中,“_”表示该县是未知的。
df_event['event_id']
is a unique string. df_event['event_id']
是一个唯一的字符串。 There are rows in the dataframe that are nearly identical but with different ids indicating that an event has occurred on multiple occasions. dataframe 中有几行几乎相同,但具有不同的 ID,表明事件已多次发生。 This has no affect on the.
这对 没有影响。
state_fips
or county_fips
. state_fips
或county_fips
。
I know this is a multi step process but all help is appreciated.我知道这是一个多步骤的过程,但感谢所有帮助。 Thank you.
谢谢你。
You can do this using df.merge
:您可以使用
df.merge
执行此操作:
In [289]: df_event['state_fips'] = df_event.merge(dfgeo[['state_fips','area_name']], left_on='state', right_on='area_name', how='left')['state_fips']
In [290]: df_event['county_fips'] = df_event.merge(dfgeo[['county_fips','area_name']], left_on='county', right_on='area_name', how='left')['county_fips']
In [291]: df_event
Out[291]:
unique_str county state state_fips county_fips
0 Event Id _ Alabama 01 NaN
1 Event Id _ Connecticut 09 NaN
2 Event Id Autauga County Alabama 01 001
3 Event Id Fairfield County Connecticut 09 001
4 Event Id Fairbanks North Star Borough Alaska 02 090
If there are duplicates in use area_name
column first remove them by DataFrame.drop_duplicates
:如果使用
area_name
列中存在重复项,请首先通过DataFrame.drop_duplicates
将其删除:
dfgeo = dfgeo.drop_duplicates('area_name')
And then Series.map
, what is faster like merge
, so it should be preferable:然后是
Series.map
,像merge
更快,所以应该更可取:
df_event['state_fips'] = df_event['state'].map(dfgeo.set_index('area_name')['state_fips'])
df_event['county_fips'] = df_event['county'].map(dfgeo.set_index('area_name')['county_fips'])
print (df_event)
unique_str county state state_fips county_fips
0 Event Id _ Alabama 01 NaN
1 Event Id _ Connecticut 09 NaN
2 Event Id Autauga County Alabama 01 001
3 Event Id Fairfield County Connecticut 09 001
4 Event Id Fairbanks North Star Borough Alaska 02 090
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.