简体   繁体   English

基于具有预分配唯一标识符的 dataframe 为 dataframe 行分配唯一标识符

[英]Assign unique identifier for dataframe rows based on dataframe with preassigned unique identifier

I have dataframe with unique identifier assigned based on three columns ie, [col2,col3,col3]我有 dataframe 具有基于三列分配的唯一标识符,即 [col2,col3,col3]

Dataframe1:数据框1:

col1      col2     col3     col4      col5         unique_id
1         abc       bcv      zxc      www.com        8
2         bcd       qwe      rty      www.@com       12
3         klp       oiu      ytr      www.io         15
4         zxc       qwe      rty      www.com        6

After data preprocessing, will import Dataframe_2 with same column values as shown above but without unique_id.数据预处理后,将导入具有与上图相同的列值但没有 unique_id 的 Dataframe_2。 Dataframe_2 rows must be assigned with unique identifier based on col2,col3,col4 and by referring to the Dataframe1. Dataframe_2 行必须根据 col2、col3、col4 并通过引用 Dataframe1 分配唯一标识符。

If Dataframe_2 has new row which is not present in Dataframe1, then assign new identifier.如果 Dataframe_2 具有 Dataframe1 中不存在的新行,则分配新标识符。

Dataframe_2:数据框_2:

col1      col2     col3     col4      col5         
1         bcd       qwe      rty      www.@com              
2         zxc       qwe      rty      www.com
3         abc       bcv      zxc      www.com 
4         kph       hir      mat      www.com            

Expected Dataframe_2:预期的 Dataframe_2:

col1      col2     col3     col4      col5         unique_id        
1         bcd       qwe      rty      www.@com        12     
2         zxc       qwe      rty      www.com         6
3         abc       bcv      zxc      www.com         8 
4         kph       hir      mat      www.com         35

Since Row4 is not present in Dataframe1, a new unique identifier is assigned.由于 Dataframe1 中不存在 Row4,因此分配了一个新的唯一标识符。

# assign the old unique_id
df2n = df2.join(df1.set_index(['col2', 'col3', 'col4', 'col5'])[['unique_id']],
         on=['col2', 'col3', 'col4', 'col5'], how='left')

# assign new unique_id with max df1.unique_id + 1
id_max = df1.unique_id.max() + 1
null_num = df2n['unique_id'].isnull().sum()

cond = df2n['unique_id'].isnull()
df2n.loc[cond,'unique_id'] = range(id_max, id_max + null_num)
df2n['unique_id'] = df2n['unique_id'].astype(int)

print(df2n)

      col1 col2 col3 col4      col5  unique_id
    0     1  bcd  qwe  rty  www.@com         12
    1     2  zxc  qwe  rty   www.com          6
    2     3  abc  bcv  zxc   www.com          8
    3     4  kph  hir  mat   www.com         16

First add column unique_id by DataFrame.merge with left join on parameter is omitted for merge by columns ['col2','col3','col4'] specified in subset.首先通过DataFrame.merge添加列unique_idon子集中指定的列['col2','col3','col4']合并,省略左连接参数。 For not matched values are created missing values, so is used Series.isna for test them and np.arange for create new array after maximal value and assign them in DataFrame.loc对于不匹配的值,创建缺失值,因此使用Series.isna测试它们,使用np.arange在最大值后创建新数组并在DataFrame.loc中分配它们

df = Dataframe_2.merge(Dataframe_1[['col2','col3','col4', 'unique_id']],
                       how='left')

mask = df['unique_id'].isna()
maximal = Dataframe_1['unique_id'].max() + 1

df.loc[mask, 'unique_id'] = np.arange(maximal, maximal + mask.sum())

df['unique_id'] = df['unique_id'].astype(int)
print (df)
   col1 col2 col3 col4      col5  unique_id
0     1  bcd  qwe  rty  www.@com         12
1     2  zxc  qwe  rty   www.com          6
2     3  abc  bcv  zxc   www.com          8
3     4  kph  hir  mat   www.com         16
import math
import random
import pandas as pd
import numpy as np

df3 = pd.merge(df1,df2, on=['col2','col3','col4'], how='right')

def return_unique_num(df1):
  uniqueIds = list(df1['unique_id'].values)
  unique_num = random.randint(1,len(df1)+1)
  while True:
    if unique_num in uniqueIds:
      unique_num = random.randint(1,len(df1)+1)
    else:
      break
  return unique_num

for i, e in enumerate(df3['unique_id']):
  if math.isnan(e):
    df3.iloc[i, 5] = return_unique_num(df1) #replace nan value with unique integer in df3 unique_id column


df3['unique_id'] = df3['unique_id'].astype(int)

df2['unique_id'] = df3['unique_id']

It will assign unique IDs to df2 based on unique_id of df1它将根据 df1 的 unique_id 为 df2 分配唯一 ID

Output Output

col1      col2     col3     col4      col5         unique_id        
1         bcd       qwe      rty      www.@com        12     
2         zxc       qwe      rty      www.com         6
3         abc       bcv      zxc      www.com         8 
4         kph       hir      mat      www.com         35

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM