简体   繁体   中英

How to validate csv data read by pandas.read_csv?

Looking to validate the data in a csv file using Python and Pandas. Everything works fine when feeding clean data. However when there is a problem with the data, the issues are hard to find. Raising any sort of error would be great. Here is some pseudo code:

dtypes = {'Date': 'str', 'yesno': 'str', 'int_val': 'int', 'decimal_value': 'str'}
df = pd.read_csv(filename, dtype=dtypes)

# Ensure exceptions are thrown for invalid data.

# valid date format in date. ValueError raised for invalid data.
pd.to_datetime(df['Date'])

# 'yes' or 'no' for the yesno field. (has to be in a list of values)
# valid integer for int_val.
# a valid integer or decimal number for decimal_value

I am not even sure the pd.to_datetime is the best way to validate a date. What is a good way of doing this?

'yes' or 'no' for the yesno field. (has to be in a list of values):

df.yesno.isin(['yes','no']).all() # Returns False if there are any other values

valid integer for int_val:

df.int_val.astype(int) # Throws an error if there are non-integers
# or, if int_val are floats:
import numpy as np
np.isclose(df.int_val.round(0),df.int_val.astype(int)).all()

a valid integer or decimal number for decimal_value:

df.decimal_value.astype(float) # similar to above

Using pd.to_datetime() to validate dates is probably best; you can also specify the format of the dates if necessary, eg with the keyword argument format = '%y-%m-%d' which expects dates to be of the form yyyy-mm-dd .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM