实施 cdc 但在 Python Pandas 中出现值错误

Question

I am trying to perform CDC operation via Python.我正在尝试通过 Python 执行 CDC 操作。 I am trying to perform union of the unchanged data (master file / base table) with the new file (delta file).我正在尝试将未更改的数据（主文件/基表）与新文件（增量文件）合并。

Below is the function I have written:下面是我写的function：

def processInputdata():
    df1 = pd.read_csv('master.csv')
    df2 = pd.read_csv('delta.csv')
    df=pd.merge(df1,df2,on=['cust_id','cust_id'],how="outer",indicator=True)
    dfo=df[df['_merge']=='left_only']
    dfT =pd.merge(dfo,df2,on=['cust_id','cust_id'],how="right",indicator=True)

This is not working.这是行不通的。 Below is the error message:以下是错误消息：

ValueError: Cannot use name of an existing column for indicator column ValueError：不能将现有列的名称用于指示符列

I am not sure if there is any simpler or better approach to perform CDC.我不确定是否有任何更简单或更好的方法来执行 CDC。

Sample data:样本数据：

Master file:主文件：

   cust_id cust_name  cust_income cust_phone
0      111     a            78000       sony
1      222     b             8000        jio
2      333     c           108000     iphone
3      444     d           200000    iphoneX
4      555     e            20000    samsung

Delta file:增量文件：

 cust_id cust_name  cust_income cust_phone
0      222     b        20000          jio
1      333     c        120000     iphoneX
2      666     f        76000      oneplus

Expected output:预期 output：

   cust_id cust_name  cust_income cust_phone
0      111     a            78000       sony
1      222     b            20000        jio
2      333     c           120000     iphoneX
3      444     d           200000    iphoneX
4      555     e            20000    samsung
5.     666     f           76000     oneplus

Answer 1

Using append with drop_duplicates with keep='last' :将append与drop_duplicates与keep='last'一起使用：

df = master.append(delta)\
           .drop_duplicates(subset=['cust_id','cust_phone'], keep='last')\
           .sort_values('cust_name').reset_index(drop=True)

   cust_id cust_name  cust_income cust_phone
0      111         a        78000       sony
1      222         b         8000        jio
2      333         c       108000    iphoneX
3      444         d       200000    iphoneX
4      555         e        20000    samsung
5      666         f        76000    oneplus

Answer 2

Use DataFrame.merge + DataFrame.drop_duplicates :使用DataFrame.merge + DataFrame.drop_duplicates ：

new_df=( df_master.merge(df_delta,how='outer',sort=False)
                  .drop_duplicates(['cust_name','cust_phone'],keep='last')
                  .sort_values('cust_id')
                  .reset_index(drop=True) )
print(new_df)

   cust_id cust_name  cust_income cust_phone
0      111         a        78000       sony
1      222         b        20000        jio
2      333         c       120000    iphoneX
3      444         d       200000    iphoneX
4      555         e        20000    samsung
5      666         f        76000    oneplus

or pd.concat :或pd.concat ：

new_df=(pd.concat([df_master,df_delta],sort=False)
          .drop_duplicates(['cust_name','cust_phone'],keep='last')
          .sort_values('cust_id')
          .reset_index(drop=True) )
print(new_df)

   cust_id cust_name  cust_income cust_phone
0      111         a        78000       sony
1      222         b        20000        jio
2      333         c       120000    iphoneX
3      444         d       200000    iphoneX
4      555         e        20000    samsung
5      666         f        76000    oneplus

实施 cdc 但在 Python Pandas 中出现值错误

问题描述

2 个解决方案

解决方案1
5 已采纳 2019-11-06 13:50:19

解决方案2
2 2019-11-06 13:52:13

实施 cdc 但在 Python Pandas 中出现值错误

问题描述

2 个解决方案

解决方案1 5 已采纳 2019-11-06 13:50:19

解决方案2 2 2019-11-06 13:52:13

解决方案1
5 已采纳 2019-11-06 13:50:19

解决方案2
2 2019-11-06 13:52:13