In Python, run each row in a csv through tests and output a new csv showing which test each row failed

Question

In python, I would like to run a csv through test cases checking for data anomalies while keep track of each test it fails.

This is my first big project in Python, though I have Python experience and can do basic one-liners using pandas like df.drop_duplicates(subset=['UniqueID']) , I just am not sure what the right direction would be.

Data Example:

MnLast	MnFist	MnDead?	MnInactive?	SpLast	SpFirst	SPInactive?	SpDead?	Addee	Sal
Doe	John	No	No	Doe	Jane	No	No	Mr. John Doe	Mr. John

Main(Mn) record isn't blank, Spouse(Sp) record isn't blank, neither record is marked deceased but Addee or Sal doesn't have '&' or 'and'. This indicates the Addressee(Addee) or Salutation(Sal) is incorrect, as Addressee or Salutation should have a variation of: " Mr. and Mrs. John doe "

Pseudo code:

Read csv

for each row in csv

  #test case 1
  if [ {( (MNFirst AND MnLast) =! BLANK ) AND ( (SpLast AND SpFirst) =! BLANK )} AND 
  (( SpDead? AND MnDead?) =! Yes)] AND [(Addee OR Sal) Does not contains ('&' or 'and')] 
  
     output failing row to new csv tracking what case it failed

  else 

      nothing

My goal

Read a csv file, run the file through several test cases(there are several). Then output a new csv, with a new column indicating each case it failed. So if my Data Example failed 3 different cases, the new column would show a number corresponding to the case it failed. The csv output would show the following:

CaseFailed	MnLast	MnFist	MnDead?	MnInactive?	SpLast	SpFirst	SPInactive?	SpDead?	Addee	Sal
1, 5, 8	Doe	john	No	No	Doe	Jane	No	No	Mr. John Doe	Mr. John

Any help to point me in the right direction would be greatly appreciated.

Answer 1

import pandas as pd 
import numpy as np

data = pd.read_csv(csv_file, encoding='latin-1')

# Create array to track failed cases.
data['Failed Test']= ''
    data = data.replace(np.nan,'')
    data.insert(0, 'ID', range(0, len(data)))

# Test 1: The spouse shows a deceased date, but martial status is not marked as widowed
  testcase1 = data[((data['SRDeceasedDate'] != '') & (data['MrtlStat'] != 'Widowed'))]
    ids = testcase1.index.tolist()
    for i in ids:
      data.at[i,'Failed Test']+=', 1'

# Test 2: Spouse name information is filled in but marital status shows single. 
 df = data[((data['SRLastName'] != '') | (data['SRFirstName'] != ''))]
    testcase2 = df[df['MrtlStat'] == 'single']
    ids = testcase2.index.tolist()
    for i in ids:
      data.at[i,'Failed Test']+=', 2'

# sort and separate  which rows have failed a test
failed = data[(data['Failed test'] != '')]
passed = data[(data['Failed test'] == '')]
failed['Failed Test'] = failed['Failed Test'].str[1:]
failed = failed[(failed['Failed Test'] != '')]

# Clean up
del failed["ID"]
del passed["ID"]

# Print results 
failed['Test Case Failed'].value_counts()
print("There was a total of",data.shape[0], "rows.", "There were" ,data.shape[0] - failed.shape[0], "rows passed and" ,failed.shape[0], "rows failed at least one test case")

# output failed rows
failed.to_csv("C:/Users/Output/failed.csv", index=False,) 

# output passed rows
passed.to_csv("C:/Users/Output/passed.csv", index=False,)

In Python, run each row in a csv through tests and output a new csv showing which test each row failed

Question

1 answers

solution1
0 ACCPTED 2022-08-03 18:35:57

In Python, run each row in a csv through tests and output a new csv showing which test each row failed

Question

1 answers

solution1 0 ACCPTED 2022-08-03 18:35:57

solution1
0 ACCPTED 2022-08-03 18:35:57