简体   繁体   English

Map 通过搜索另一个新列的值 dataframe

[英]Map the value of a new column by searching another dataframe

I have two dataframes: df_geo and df_event .我有两个数据框: df_geodf_event I want to create two new columns in df_event .我想在df_event创建两个新列。 The data frames resemble the following, although additional columns have been removed for the sake of simplicity:数据框类似于以下内容,但为简单起见删除了其他列:

data_geo =  [['040','01','000','00000','00000','00000','Alabama'],
             ['050','01','001','00000','00000','00000','Autauga County'],
             ['050','01','097','00000','00000','00000','Mobile County'],
             ['050','01','101','00000','00000','00000','Montgomery County'],
             ['050','01','115','00000','00000','00000','St. Clair County'],
             ['040','09','000','00000','00000','00000','Connecticut'],
             ['061','09','001','04720','00000','00000','Bethel town'],
             ['040','17','000','00000','00000','00000','Illinois'],
             ['061','17','109','05638','00000','00000','Bethel township'],
             ['050','17','163','00000','00000','00000','St. Clair County']] 

dfgeo = pd.DataFrame(data_geo, columns = ['summary_level', 'state_fips','county_fips','subdivision_code_fips','place_code_fips','city_code_fips','area_name']) 

df_geo.info()

RangeIndex: 43847 entries, 0 to 43846
Data columns (total 7 columns):
summary_level            43847 non-null object
state_fips               43847 non-null object
county_fips              43847 non-null object
subdivision_code_fips    43847 non-null object
place_code_fips          43847 non-null object
city_code_fips           43847 non-null object
area_name                43847 non-null object
data_event = [['event_id','_','Alabama'], 
              ['event_id','_','Connecticut'],
              ['event_id','Autauga County','Alabama'],
              ['event_id','Fairfield County','Connecticut'],
              ['event_id','Fairbanks North Star Borough','Alaska']] 

df_event = pd.DataFrame(data_event, columns = ['event_id','county','state']) 

df_event.info()

RangeIndex: 1261 entries, 0 to 1260
Data columns (total 3 columns):
event_id                1261 non-null object
county                   999 non-null object
state                   1261 non-null object
dtypes: object(3) 

GOAL to create a function that can take the county and state inputs from df_event and in order to create two new columns in the same dataframe.目标是创建一个 function,它可以从df_event获取countystate输入,并在同一 dataframe 中创建两个新列。 The new columns are based on the values of state_fips and county_fips in df_geo .新列基于 df_geo 中state_fipscounty_fipsdf_geo An example of this would look like the following:这方面的一个示例如下所示:

inputA fun('df_geo','Connecticut','Fairfield County'):   

resultA = ['event_id','Connecticut','Fairfield County','09','001']
                                                       ^New columns

inputB fun('df_geo','Alaska','Fairbanks North Star Borough'):   

resultB = ['event_id','Alaska','Fairbanks North Star Borough','02','090']
                                                              ^New columns

This is a PROBLEM because I also need to use this function on a list of 1,200 (and growing) events the function would have to work within a lamba function or something else that can map it across the entire dataframe. This is a PROBLEM because I also need to use this function on a list of 1,200 (and growing) events the function would have to work within a lamba function or something else that can map it across the entire dataframe.

This is complicated by identical County names like "St. Clair County" that appear in several states.这因出现在几个州的相同县名(例如“圣克莱尔县”)而变得复杂。 Although their area_names are identical the value of state_fips will be different.尽管它们的area_names相同,但state_fips的值将不同。

The state_fips of St. Clair Illinois is 17 , the same as all other counties in Illinois and the state itself.圣克莱尔伊利诺伊州的state_fips17 ,与伊利诺伊州的所有其他县和 state 本身相同。 The state_fips of St. Clair Alabama is 01 , the same as all other counties in Alabama, and so on...圣克莱尔阿拉巴马州的state_fips01 ,与阿拉巴马州的所有其他县相同,依此类推...

I would like to use the same search and map function all the way down to city_code_fips .我想使用相同的搜索和 map function 一直到city_code_fips At that level any search terms have to be exactly the same to avoid picking up "Bethel town" when I intend to find "Bethel township".在那个级别,任何搜索词都必须完全相同,以避免在我打算查找“Bethel township”时选择“Bethel town”。 Exact inputs are also important because some states, like Louisiana, call their county level geographies by another name.准确的输入也很重要,因为一些州,如路易斯安那州,用另一个名字来称呼他们的县级地理。

In df_event a '_' indicates that the county is unknown.df_event中,“_”表示该县是未知的。

df_event['event_id'] is a unique string. df_event['event_id']是一个唯一的字符串。 There are rows in the dataframe that are nearly identical but with different ids indicating that an event has occurred on multiple occasions. dataframe 中有几行几乎相同,但具有不同的 ID,表明事件已多次发生。 This has no affect on the.这对 没有影响。 state_fips or county_fips . state_fipscounty_fips

I know this is a multi step process but all help is appreciated.我知道这是一个多步骤的过程,但感谢所有帮助。 Thank you.谢谢你。

You can do this using df.merge :您可以使用df.merge执行此操作:

In [289]: df_event['state_fips'] = df_event.merge(dfgeo[['state_fips','area_name']], left_on='state', right_on='area_name', how='left')['state_fips']    
In [290]: df_event['county_fips'] = df_event.merge(dfgeo[['county_fips','area_name']], left_on='county', right_on='area_name', how='left')['county_fips']

In [291]: df_event
Out[291]: 
  unique_str                        county        state state_fips county_fips
0   Event Id                             _      Alabama         01         NaN
1   Event Id                             _  Connecticut         09         NaN
2   Event Id                Autauga County      Alabama         01         001
3   Event Id              Fairfield County  Connecticut         09         001
4   Event Id  Fairbanks North Star Borough       Alaska         02         090

If there are duplicates in use area_name column first remove them by DataFrame.drop_duplicates :如果使用area_name列中存在重复项,请首先通过DataFrame.drop_duplicates将其删除:

dfgeo = dfgeo.drop_duplicates('area_name')

And then Series.map , what is faster like merge , so it should be preferable:然后是Series.map ,像merge更快,所以应该更可取:

df_event['state_fips'] = df_event['state'].map(dfgeo.set_index('area_name')['state_fips'])
df_event['county_fips'] = df_event['county'].map(dfgeo.set_index('area_name')['county_fips'])
print (df_event)
  unique_str                        county        state state_fips county_fips
0   Event Id                             _      Alabama         01         NaN
1   Event Id                             _  Connecticut         09         NaN
2   Event Id                Autauga County      Alabama         01         001
3   Event Id              Fairfield County  Connecticut         09         001
4   Event Id  Fairbanks North Star Borough       Alaska         02         090

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 通过另一个列的值映射数据框列的值 - Map dataframe column value by another column's value 如果数据框存在于另一个数据框列中,则搜索它的子字符串 - Searching substring of a dataframe if it exists in another dataframe column 如何根据另一个 dataframe 的匹配为 dataframe 的新列添加值? - how to add value to a new column to a dataframe based on the match of another dataframe? 我想在 dataframe 中迭代,在另一个 dataframe 中添加值(新列) - I want to iterate in dataframe adding value( new column) in another dataframe 根据另一个 dataframe 的值创建新列 dataframe 运行速度快吗? - create new column of dataframe base on value of another dataframe run fast? 创建新的数据框列,保留另一列的第一个值 - Create new dataframe column keeping the first value from another column 如何通过搜索现有列值而不迭代在数据框中追加新行? - How to append a new row in a dataframe by searching for an existing column value without iterating? 将新列作为增量计算为pandas数据框中的另一个值 - Compute a new column as delta to another value in pandas dataframe Pandas:添加新列并按条件从另一个dataframe赋值 - Pandas: Add new column and assigning value from another dataframe by condition 根据另一行中的值为 DataFrame 分配一个新列 - Assign a new column to DataFrame according to a value in another row
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM