A corollary of the question here: create unique identifier in dataframe based on combination of columns
In the foll. dataframe,
id Lat Lon Year Area State
50319 -36.0629 -62.3423 2019 90 Iowa
18873 -36.0629 -62.3423 2017 90 Iowa
18876 -36.0754 -62.327 2017 124 Illinois
18878 -36.0688 -62.3353 2017 138 Kansas
I want to create a new column which assigns a unique identifier based on whether the columns Lat, Lon and Area have the same values. Eg in this case rows 1 and 2 have the same values in those columns and will be given the same unique identifier 0_Iowa where Iowa comes from the State column. However, if there is no duplicate for a row, then I just want to use the state name. The end result should look like this:
id Lat Lon Year Area State unique_id
50319 -36.0629 -62.3423 2019 90 Iowa 0_Iowa
18873 -36.0629 -62.3423 2017 90 Iowa 0_Iowa
18876 -36.0754 -62.327 2017 124 Illinois Illinois
18878 -36.0688 -62.3353 2017 138 Kansas Kansas
You can use an np.where
:
df['unique_id'] = np.where(df.duplicated(['Lat','Lon'], keep=False),
df.groupby(['Lat','Lon'], sort=False).ngroup().astype('str') + '_' + df['State'],
df['State'])
Or similar idea with pd.Series.where
:
df['unique_id'] = (df.groupby(['Lat','Lon'], sort=False)
.ngroup().astype('str')
.add('_' + df['State'])
.where(df.duplicated(['Lat','Lon'], keep=False),
df['State']
)
)
Output:
id Lat Lon Year Area State unique_id
0 50319 -36.0629 -62.3423 2019 90 Iowa 0_Iowa
1 18873 -36.0629 -62.3423 2017 90 Iowa 0_Iowa
2 18876 -36.0754 -62.3270 2017 124 Illinois Illinois
3 18878 -36.0688 -62.3353 2017 138 Kansas Kansas
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.