[英]Assign unique identifier for dataframe rows based on dataframe with preassigned unique identifier
I have dataframe with unique identifier assigned based on three columns ie, [col2,col3,col3]我有 dataframe 具有基于三列分配的唯一标识符,即 [col2,col3,col3]
Dataframe1:数据框1:
col1 col2 col3 col4 col5 unique_id
1 abc bcv zxc www.com 8
2 bcd qwe rty www.@com 12
3 klp oiu ytr www.io 15
4 zxc qwe rty www.com 6
After data preprocessing, will import Dataframe_2 with same column values as shown above but without unique_id.数据预处理后,将导入具有与上图相同的列值但没有 unique_id 的 Dataframe_2。 Dataframe_2 rows must be assigned with unique identifier based on col2,col3,col4 and by referring to the Dataframe1.
Dataframe_2 行必须根据 col2、col3、col4 并通过引用 Dataframe1 分配唯一标识符。
If Dataframe_2 has new row which is not present in Dataframe1, then assign new identifier.如果 Dataframe_2 具有 Dataframe1 中不存在的新行,则分配新标识符。
Dataframe_2:数据框_2:
col1 col2 col3 col4 col5
1 bcd qwe rty www.@com
2 zxc qwe rty www.com
3 abc bcv zxc www.com
4 kph hir mat www.com
Expected Dataframe_2:预期的 Dataframe_2:
col1 col2 col3 col4 col5 unique_id
1 bcd qwe rty www.@com 12
2 zxc qwe rty www.com 6
3 abc bcv zxc www.com 8
4 kph hir mat www.com 35
Since Row4 is not present in Dataframe1, a new unique identifier is assigned.由于 Dataframe1 中不存在 Row4,因此分配了一个新的唯一标识符。
# assign the old unique_id
df2n = df2.join(df1.set_index(['col2', 'col3', 'col4', 'col5'])[['unique_id']],
on=['col2', 'col3', 'col4', 'col5'], how='left')
# assign new unique_id with max df1.unique_id + 1
id_max = df1.unique_id.max() + 1
null_num = df2n['unique_id'].isnull().sum()
cond = df2n['unique_id'].isnull()
df2n.loc[cond,'unique_id'] = range(id_max, id_max + null_num)
df2n['unique_id'] = df2n['unique_id'].astype(int)
print(df2n)
col1 col2 col3 col4 col5 unique_id
0 1 bcd qwe rty www.@com 12
1 2 zxc qwe rty www.com 6
2 3 abc bcv zxc www.com 8
3 4 kph hir mat www.com 16
First add column unique_id
by DataFrame.merge
with left join on
parameter is omitted for merge by columns ['col2','col3','col4']
specified in subset.首先通过
DataFrame.merge
添加列unique_id
, on
子集中指定的列['col2','col3','col4']
合并,省略左连接参数。 For not matched values are created missing values, so is used Series.isna
for test them and np.arange
for create new array after maximal value and assign them in DataFrame.loc
对于不匹配的值,创建缺失值,因此使用
Series.isna
测试它们,使用np.arange
在最大值后创建新数组并在DataFrame.loc
中分配它们
df = Dataframe_2.merge(Dataframe_1[['col2','col3','col4', 'unique_id']],
how='left')
mask = df['unique_id'].isna()
maximal = Dataframe_1['unique_id'].max() + 1
df.loc[mask, 'unique_id'] = np.arange(maximal, maximal + mask.sum())
df['unique_id'] = df['unique_id'].astype(int)
print (df)
col1 col2 col3 col4 col5 unique_id
0 1 bcd qwe rty www.@com 12
1 2 zxc qwe rty www.com 6
2 3 abc bcv zxc www.com 8
3 4 kph hir mat www.com 16
import math
import random
import pandas as pd
import numpy as np
df3 = pd.merge(df1,df2, on=['col2','col3','col4'], how='right')
def return_unique_num(df1):
uniqueIds = list(df1['unique_id'].values)
unique_num = random.randint(1,len(df1)+1)
while True:
if unique_num in uniqueIds:
unique_num = random.randint(1,len(df1)+1)
else:
break
return unique_num
for i, e in enumerate(df3['unique_id']):
if math.isnan(e):
df3.iloc[i, 5] = return_unique_num(df1) #replace nan value with unique integer in df3 unique_id column
df3['unique_id'] = df3['unique_id'].astype(int)
df2['unique_id'] = df3['unique_id']
It will assign unique IDs to df2 based on unique_id of df1它将根据 df1 的 unique_id 为 df2 分配唯一 ID
Output Output
col1 col2 col3 col4 col5 unique_id
1 bcd qwe rty www.@com 12
2 zxc qwe rty www.com 6
3 abc bcv zxc www.com 8
4 kph hir mat www.com 35
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.