I am working on two big data bases:
I want to join
the two data frames by County and State and Year , but the dataM has to retain all of the columns, and only get de Deprivation Index Percent of the dataD. Also, I want to drop
the rows where counties does not exist on one or the another. For instance, on dataM we have AK and its counties, but on dataD there is not AK, so I want to drop
all those rows on dataM. In the same way, if the counties and states exist in both, I want to assign the Deprivation Index Percent to all the rows with that county in that state. I tried everyting, buy I can't make it work.
I tried this in many forms:
dataM = pd.merge(dataM, dataD, how='right', left_on=['County', 'State'], right_on=['County', 'State'])
and by filtering Baldwin county which is on both data frames, I got this:
I don't understand why I am getting NaN if the county and state exist in both data frames. Please help me.
I think you need an inner join -
dataM = pd.merge(dataM, dataD[['depr_ind_col', 'County', 'State']], how='inner', left_on=['County', 'State'], right_on=['County', 'State'])
After so many tries, I ended up concatenating the county and state for dataM and assigning it to a new column name "County, State". Then, I just used a simple merge method:
dataM = pd.merge(dataM , dataD, how='right', on=['County, State'])
dataM = dataM[dataM['County, State'] == 'Baldwin County, GA']
dataM
That gave me the results a was looking for. I will split the county and state after this, and then drop rows with NaN on Births.
Thank you for your help though!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.