简体   繁体   中英

Pandas Dataframe column Validation

i have a pandas dataframe with 30 columnns and 4000 rows.

for about 5 columns i need to validate that it meets data validation is there a way to say something like " if df.Gender contains any value thats not 'M' or 'F' then print error "

or if df.MaritalStatus contains a value thats not M, S, D then print error.

sample of df

anyone have any best way of applying the conditions?

df = pd.read_csv("C:/Users/ABV1234/Desktop/DailyReport.csv")

##if df.Gender contains value that is not in ['m', 'f'] print Error

You can check if any of ['M', 'F'] are in the values of df.Gender :

if not any(x in df.Gender.values for x in ['M','F'])
    print("Error")

Checking the 1st condition

if df.Gender contains any value thats not 'M' or 'F' then print error

gender_series = df.Gender.values

for x in gender_series:
    if x not in ('M', 'F'):
        print("error")

Checking the second Condition:

if df.MaritalStatus contains a value thats not M, S, D then print error.

maritalstatus_series = df.MaritalStatus.values

for x in maritalstatus_series:
    if x not in ('M', 'S', 'D'):
        print("error")

Thanks

A possible improvement on the above answers would be to collect and report all the failure cases after evaluating the entire column.

This will return a filtered dataframe of all cases where Gender column is not equal to 'M' or 'F'.

import pandas as pd
df = pd.DataFrame({"MaritalStatus":["M","S","F"],"Gender":["M","S","F"]})
df.loc[~df.loc[:,"Gender"].isin(['M','F']),:] 
>>>  MaritalStatus Gender
    1             S      S

The same can be done for marital status:

df.loc[~df.loc[:,"MaritalStatus"].isin(['M','S','D']),:]
>>>  MaritalStatus Gender
    2             F      F

If you're spot-checking the data for unexpected values, you can then get the values that fail these conditions:

expected_values = {"MaritalStatus":['M','S','D'],"Gender":['M','F']}
for feature in expected_values:
    print(f"The following unexpected values were found in {feature} column:",
    set(df.loc[~df.loc[:,feature].isin(expected_values[feature]),:][feature]))
>>> The following unexpected values were found in MaritalStatus column: {'F'}
>>> The following unexpected values were found in Gender column: {'S'}

Alternatively, you can use the pandera library, which allows you to establish expectations of your dataset and validate it against those expectations. Doing lazy evaluation allows you to see all the fail cases at once instead of getting a failure at each individual case.

import pandera as pa

schema = pa.DataFrameSchema(
    {
"MaritalStatus":pa.Column(pa.String, checks=pa.Check.isin(["M","S","D"])),
"Gender":pa.Column(pa.String, checks=pa.Check.isin(["M","F"]))
    },strict=False
)
schema.validate(df,lazy=True)

>>> 
  Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Daten\venv\lib\site-packages\pandera\schemas.py", line 592, in validate
    error_handler.collected_errors, check_obj
pandera.errors.SchemaErrors: A total of 2 schema errors were found.

Error Counts
------------
- schema_component_check: 2

Schema Error Summary
--------------------
                                                   failure_cases  n_failure_cases
schema_context column        check
Column         Gender        isin({'F', 'M'})                [S]                1
               MaritalStatus isin({'M', 'D', 'S'})           [F]                1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM