简体   繁体   中英

In Python, run each row in a csv through tests and output a new csv showing which test each row failed

In python, I would like to run a csv through test cases checking for data anomalies while keep track of each test it fails.

This is my first big project in Python, though I have Python experience and can do basic one-liners using pandas like df.drop_duplicates(subset=['UniqueID']) , I just am not sure what the right direction would be.

  • Data Example:
MnLast MnFist MnDead? MnInactive? SpLast SpFirst SPInactive? SpDead? Addee Sal
Doe John No No Doe Jane No No Mr. John Doe Mr. John

Main(Mn) record isn't blank, Spouse(Sp) record isn't blank, neither record is marked deceased but Addee or Sal doesn't have '&' or 'and'. This indicates the Addressee(Addee) or Salutation(Sal) is incorrect, as Addressee or Salutation should have a variation of: " Mr. and Mrs. John doe "

  • Pseudo code:
Read csv

for each row in csv

  #test case 1
  if [ {( (MNFirst AND MnLast) =! BLANK ) AND ( (SpLast AND SpFirst) =! BLANK )} AND 
  (( SpDead? AND MnDead?) =! Yes)] AND [(Addee OR Sal) Does not contains ('&' or 'and')] 
  
     output failing row to new csv tracking what case it failed

  else 

      nothing
  • My goal

Read a csv file, run the file through several test cases(there are several). Then output a new csv, with a new column indicating each case it failed. So if my Data Example failed 3 different cases, the new column would show a number corresponding to the case it failed. The csv output would show the following:

CaseFailed MnLast MnFist MnDead? MnInactive? SpLast SpFirst SPInactive? SpDead? Addee Sal
1, 5, 8 Doe john No No Doe Jane No No Mr. John Doe Mr. John

Any help to point me in the right direction would be greatly appreciated.

import pandas as pd 
import numpy as np

data = pd.read_csv(csv_file, encoding='latin-1')

# Create array to track failed cases.
data['Failed Test']= ''
    data = data.replace(np.nan,'')
    data.insert(0, 'ID', range(0, len(data)))

# Test 1: The spouse shows a deceased date, but martial status is not marked as widowed
  testcase1 = data[((data['SRDeceasedDate'] != '') & (data['MrtlStat'] != 'Widowed'))]
    ids = testcase1.index.tolist()
    for i in ids:
      data.at[i,'Failed Test']+=', 1'

# Test 2: Spouse name information is filled in but marital status shows single. 
 df = data[((data['SRLastName'] != '') | (data['SRFirstName'] != ''))]
    testcase2 = df[df['MrtlStat'] == 'single']
    ids = testcase2.index.tolist()
    for i in ids:
      data.at[i,'Failed Test']+=', 2'

# sort and separate  which rows have failed a test
failed = data[(data['Failed test'] != '')]
passed = data[(data['Failed test'] == '')]
failed['Failed Test'] = failed['Failed Test'].str[1:]
failed = failed[(failed['Failed Test'] != '')]

# Clean up
del failed["ID"]
del passed["ID"]

# Print results 
failed['Test Case Failed'].value_counts()
print("There was a total of",data.shape[0], "rows.", "There were" ,data.shape[0] - failed.shape[0], "rows passed and" ,failed.shape[0], "rows failed at least one test case")

# output failed rows
failed.to_csv("C:/Users/Output/failed.csv", index=False,) 

# output passed rows
passed.to_csv("C:/Users/Output/passed.csv", index=False,) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM