i have a pandas dataframe with 30 columnns and 4000 rows.
for about 5 columns i need to validate that it meets data validation is there a way to say something like " if df.Gender contains any value thats not 'M' or 'F' then print error
"
or if df.MaritalStatus contains a value thats not M, S, D then print error.
anyone have any best way of applying the conditions?
df = pd.read_csv("C:/Users/ABV1234/Desktop/DailyReport.csv")
##if df.Gender contains value that is not in ['m', 'f'] print Error
You can check if any of ['M', 'F']
are in the values of df.Gender
:
if not any(x in df.Gender.values for x in ['M','F'])
print("Error")
Checking the 1st condition
if df.Gender contains any value thats not 'M' or 'F' then print error
gender_series = df.Gender.values
for x in gender_series:
if x not in ('M', 'F'):
print("error")
Checking the second Condition:
if df.MaritalStatus contains a value thats not M, S, D then print error.
maritalstatus_series = df.MaritalStatus.values
for x in maritalstatus_series:
if x not in ('M', 'S', 'D'):
print("error")
Thanks
A possible improvement on the above answers would be to collect and report all the failure cases after evaluating the entire column.
This will return a filtered dataframe of all cases where Gender column is not equal to 'M' or 'F'.
import pandas as pd
df = pd.DataFrame({"MaritalStatus":["M","S","F"],"Gender":["M","S","F"]})
df.loc[~df.loc[:,"Gender"].isin(['M','F']),:]
>>> MaritalStatus Gender
1 S S
The same can be done for marital status:
df.loc[~df.loc[:,"MaritalStatus"].isin(['M','S','D']),:]
>>> MaritalStatus Gender
2 F F
If you're spot-checking the data for unexpected values, you can then get the values that fail these conditions:
expected_values = {"MaritalStatus":['M','S','D'],"Gender":['M','F']}
for feature in expected_values:
print(f"The following unexpected values were found in {feature} column:",
set(df.loc[~df.loc[:,feature].isin(expected_values[feature]),:][feature]))
>>> The following unexpected values were found in MaritalStatus column: {'F'}
>>> The following unexpected values were found in Gender column: {'S'}
Alternatively, you can use the pandera library, which allows you to establish expectations of your dataset and validate it against those expectations. Doing lazy evaluation allows you to see all the fail cases at once instead of getting a failure at each individual case.
import pandera as pa
schema = pa.DataFrameSchema(
{
"MaritalStatus":pa.Column(pa.String, checks=pa.Check.isin(["M","S","D"])),
"Gender":pa.Column(pa.String, checks=pa.Check.isin(["M","F"]))
},strict=False
)
schema.validate(df,lazy=True)
>>>
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Daten\venv\lib\site-packages\pandera\schemas.py", line 592, in validate
error_handler.collected_errors, check_obj
pandera.errors.SchemaErrors: A total of 2 schema errors were found.
Error Counts
------------
- schema_component_check: 2
Schema Error Summary
--------------------
failure_cases n_failure_cases
schema_context column check
Column Gender isin({'F', 'M'}) [S] 1
MaritalStatus isin({'M', 'D', 'S'}) [F] 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.