简体   繁体   English

结合两个数据集并根据 python 中的特定条件创建新列

[英]combine two dataset and create new column based on specific condition in python

i have two dataset one is from mysql db (source) another is from snowflake db (target).我有两个数据集,一个来自 mysql db(源),另一个来自雪花 db(目标)。 i picked only one column for row level validation from both db.我从两个数据库中只选择了一列进行行级验证。 below is sample data.以下是样本数据。 Source资源

emp name
Name1
Name2
Name3
Name4
Name5
Name6
Name7
Name8
Name9
Name10

target目标

emp name
Name1
NAME2
Name3
Name4
Name5
Name6
Name7

Name9
Name10
Name11

expected output is预计 output 是

src_emp_name tgt emp name Record Valdation 
Name1         Name1    Match
Name2         NAME2    Mismatch
Name3         Name3    Match
Name4         Name4    Match
Name5         Name5    Match
Name6         Name6    Match
Name7         Name7    Match
Name8         null     Extra
Name9         Name9    Match
Name10        Name10   Match
null          Name11   Missing


count matrix
Match data      8
mismatch data   1
missing data    1
extra           1

i tried to combine two dataset(src/tgt)with merge/concat function and used np.where for creating new column based on conditions but not getting the expected output.我尝试将两个数据集(src/tgt)与 merge/concat function 结合起来,并使用 np.where 根据条件创建新列,但没有得到预期的 output。 Please suggest better way to achieve this.请提出更好的方法来实现这一点。

if you have records ordered, assigning index column and join based on that will be appropriate如果您对记录进行了排序,则分配索引列并基于此进行连接将是合适的

df_source = pd.DataFrame(['Name1',
                        'Name2',
                        'Name3',
                        'Name4',
                        'Name5',
                        'Name6',
                        'Name7',
                        'Name8',
                        'Name9',
                        'Name10',
                        ], columns =['emp_name'])

df_target = pd.DataFrame(['Name1',
                        'NAME2',
                        'Name3',
                        'Name4',
                        'Name5',
                        'Name6',
                        'Name7',
                        None,
                        'Name9',
                        'Name10',
                        'Name11',
                        ], columns =['emp_name'])


df = pd.merge(df_source, df_target,how ='outer', left_index=True, right_index=True, suffixes = ('_source', '_target'))

conditions = [
    df['emp_name_source'].isnull(),
    df['emp_name_target'].isnull(),
    df['emp_name_target'] != df['emp_name_source'],
    df['emp_name_target'] == df['emp_name_source'],
]
choices = ['Missing', 'Extra', 'Mismatch', 'Match']
df['record_valdation'] = np.select(conditions, choices)

res = df.groupby(['record_valdation'])['record_valdation'].count()
display(res)

The last line displays the result as below:最后一行显示结果如下:

record_valdation记录验证
Extra额外的 1 1
Match匹配 8 8
Mismatch不匹配 1 1
Missing失踪 1 1

I hope it helps, thank you:)希望对你有帮助,谢谢:)

I have tried to combine two datasets (src/tgt) with merge/concat function and used np.where for creating new column based on conditions but not getting the expected output.我试图将两个数据集(src/tgt)与merge/concat function 结合起来,并使用np.where根据条件创建新列,但没有得到预期的 output。 Please suggest better way to achieve this.请提出更好的方法来实现这一点。

The code below uses concat as you did, but as a better way to achieve what you are after I suggest to use pandas own .apply() method instead of using numpy .下面的代码像你一样使用concat ,但作为一种更好的方式来实现你在我建议使用 pandas 自己的.apply()方法而不是使用numpy This requires to define a function which takes a pandas DataFrame row and returns a new column value.这需要定义一个 function ,它采用 pandas DataFrame 行并返回一个新的列值。

Why another answer to the question?为什么另一个问题的答案?

The other answer uses another method for creating the 'Record Validation' column in the merged pandas DataFrame using the numpy.select() method for creating a list of values for this column.另一个答案使用另一种方法在合并的 pandas DataFrame 中创建“记录验证”列,使用numpy.select()方法为该列创建值列表。 This approach requires an import of the numpy module and requires to be extremely careful with the code for the list of conditions and the list of choices as it will become very hard to debug such code if in the resulting list for 'Record Validation' column appear a value 0 instead of one from the list of choices or if the result will be not as expected.这种方法需要导入numpy模块,并且需要非常小心条件列表和选项列表的代码,因为如果在“记录验证”列的结果列表中出现,调试此类代码将变得非常困难选择列表中的值0而不是 1,或者结果与预期不符。 Another disadvantage of using numpy.select() is that in case of very large datasets there is unnecessary evaluation of all conditions for all rows involved (while creating the conditions list) which is not the case if a function for pandas pandas.apply() is used. Another disadvantage of using numpy.select() is that in case of very large datasets there is unnecessary evaluation of all conditions for all rows involved (while creating the conditions list) which is not the case if a function for pandas pandas.apply()用来。

Other difference is that the code in the other answer doesn't handle the case of an empty string in the data row as suggested in the question giving the lines of data with an empty line.另一个区别是,另一个答案中的代码不处理数据行中空字符串的情况,如问题中所建议的那样,给出带有空行的数据行。 There is also no replacement for pandas NaN values and for the empty string values with 'null' as suggested in the question for the merged DataFrame with the 'Record Validation' column.也没有替代 pandas NaN 值和带有“null”的空字符串值,如问题中所建议的合并 DataFrame 与“记录验证”列。

The code below uses instead of a list with conditions and a list of value choices a function which will be applied to each row of the DataFrame to obtain the value for the 'Record Validation' column to add.下面的代码使用 function 而不是带有条件的列表和值选择列表,它将应用于 DataFrame 的每一行以获得要添加的“记录验证”列的值。 This gives in my eyes more room for flexibility when some other evaluation is also to be done in addition to only creating the new DataFrame column.在我看来,除了仅创建新的 DataFrame 列之外,还需要进行一些其他评估时,这为我提供了更大的灵活性空间。 The code below demonstrates this flexibility creating a dictionary with the record validation results and provide multi-line strings and comment lines with additional information.下面的代码演示了创建带有记录验证结果的字典并提供多行字符串和带有附加信息的注释行的灵活性。

The function record_validation_and_count: function record_validation_and_count:

  • merged_df, count_matrix_df = record_validation_and_count( src_df, tgt_df )

provided in the code accepts both data types for input: multiline strings and pandas DataFrames.代码中提供的输入接受两种数据类型:多行字符串和 pandas 数据帧。 It creates also in a global dictionary count_matrix_dct data for output of the count matrix as the matrix obtained by pandas grouping and counting code can't list counts for not found values (in other words there will be no entries in the pandas count matrix DataFrame having the value 0). It creates also in a global dictionary count_matrix_dct data for output of the count matrix as the matrix obtained by pandas grouping and counting code can't list counts for not found values (in other words there will be no entries in the pandas count matrix DataFrame having值 0)。

import pandas as pd
""" I have two dataset one is from mysql db (source) another is from 
snowflake db (target). i picked only one column for row level validation
from both db. below is sample data. Source"""
# Assuming that the datasets come as multiline strings: 
src_df = """\
emp name
Name1
Name2
Name3
Name4
Name5
Name6
Name7
Name8
Name9
Name10"""
tgt_df = """\
emp name
Name1
NAME2
Name3
Name4
Name5
Name6
Name7

Name9
Name10
Name11"""

"""expected output is

src_emp_name tgt emp name Record Valdation 
Name1         Name1    Match
Name2         NAME2    Mismatch
Name3         Name3    Match
Name4         Name4    Match
Name5         Name5    Match
Name6         Name6    Match
Name7         Name7    Match
Name8         null     Extra
Name9         Name9    Match
Name10        Name10   Match
null          Name11   Missing

count matrix
Match data      8
mismatch data   1
missing data    1
extra           1"""

"""i tried to combine two dataset(src/tgt)with merge/concat function"""
# USE: merged_df = pd.concat([src_df_1, tgt_df_1], axis=1)

"""and used np.where for creating new column based on conditions but not 
getting the expected output. Please suggest better way to achieve this."""
# A better approach will be using a function which is taking a row from
# the dataframe as parameter for creating new column rows. 
# This allows to include in this function modifications of the source
# dataframe replacing pandas NaN and empty '' strings with null along
# other code for evaluation (here: a dictionary for the count matrix): 
count_matrix_dct = {
    'match data'    : 0,
    'mismatch data' : 0,
    'missing data'  : 0,
    'extra'         : 0
    }
def record_validation(row, case_sensitive=True):
    #       can be changed to False HERE  --^  
    global count_matrix_dct
    #           v-- first column of the merged DataFrame
    if row.iloc[0] == '' or pd.isnull(row.iloc[0]): # but [:,0] if in df
        count_matrix_dct['missing data'] += 1
        row.iloc[0] = 'null'
        return 'Missing'
    #             v-- second column of the merged DataFrame
    elif row.iloc[1] == '' or pd.isnull(row.iloc[1]):
        count_matrix_dct['extra'] += 1
        row.iloc[1] = 'null'
        return 'Extra'
    if case_sensitive: 
        # is used to remove leading/trailing spaces  --v
        ismatch = row.iloc[0].strip() == row.iloc[1].strip()
    else: 
        # assure a not case sensitive  Match                 --v
        ismatch = row.iloc[0].upper().strip() == row.iloc[1].upper().strip()
    if ismatch: 
        count_matrix_dct['match data'] += 1
        return 'Match'
    else:
        count_matrix_dct['mismatch data'] += 1
        return 'Mismatch'
    raise ValueError("None of the conditions gives a return value") 
#:def
# ---
# Once there are data available as multiline string or pandas dataframe
# you can use following functions ... : 
def record_validation_and_count( src_df, tgt_df ):
    """ Returns DataFrame with Record Validation row and a DataFrame
    with Record Validation counts """
    global pd
    if isinstance(src_df, str): # if data are provided as string
        src_ls = src_df.split('\n')      
        src_df = pd.DataFrame(src_ls[1:],  columns=[src_ls[0]])
    if isinstance(tgt_df, str): 
        tgt_ls = tgt_df.split('\n')      
        tgt_df = pd.DataFrame(tgt_ls[1:],  columns=[tgt_ls[0]])
    if not ( isinstance(src_df, pd.DataFrame) and isinstance(tgt_df, pd.DataFrame) ):
        raise ValueError("Valid data types are: 'str'  and  'pd.DataFrame'")
    src_df.rename(columns = {src_df.columns[0]:'src_'+str(src_df.columns[0])}, inplace = True)
    tgt_df.rename(columns = {tgt_df.columns[0]:'tgt_'+str(tgt_df.columns[0])}, inplace = True)
    # in case of numerical column name necessary  ->  str()
    merged_df = pd.concat([src_df, tgt_df], axis=1)
    merged_df['Record Validation'] = merged_df.apply(record_validation, axis=1)
    #         apply function ( record_validation ) to each row      <-  axis=1 
    #         apply function ( record_validation ) to each column   <-  axis=0 
    count_matrix_df = merged_df.groupby(['Record Validation'])['Record Validation'].count()
    return merged_df, count_matrix_df
#:def
def print_results():    
   print(count_matrix_dct)
   print('====================')
   print("count matrix")
   for k,v in count_matrix_dct.items(): print(f'{k:15} {v:5d}')
   print('====================')
   count_matrix_df = merged_df.groupby(['Record Validation'])['Record Validation'].count()
   print(count_matrix_df)
   print('====================')
   print(merged_df)
#:def
# ... to create the merged dataframe with Record Validation column 
merged_df, count_matrix_df = record_validation_and_count( src_df, tgt_df )
# and print the results
print_results()

The code above gives following output:上面的代码给出了以下 output:

count_matrix_dct={'match data': 8, 'mismatch data': 1, 'missing data': 1, 'extra': 1}
====================
count matrix
match data          8
mismatch data       1
missing data        1
extra               1
====================
Record Validation
Extra       1
Match       8
Mismatch    1
Missing     1
Name: Record Validation, dtype: int64
====================
   src emp name tgt emp name Record Validation
0         Name1        Name1             Match
1         Name2        NAME2          Mismatch
2         Name3        Name3             Match
3         Name4        Name4             Match
4         Name5        Name5             Match
5         Name6        Name6             Match
6         Name7        Name7             Match
7         Name8         null             Extra
8         Name9        Name9             Match
9        Name10       Name10             Match
10         null       Name11           Missing

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM