简体   繁体   中英

How to find inappropriate datatype in pandas data frame for each column?

main_df:

    Name    Age   Id     DOB
0   Tom     20   A4565  22-07-1993
1   nick    21   G4562  11-09-1996
2   krish   AKL  F4561  15-03-1997
3   636A    18   L5624  06-07-1995
4   mak     20   K5465  03-09-1997
5   nits    55   56541  45aBc
6   444     66   NIT    09031992

column_info_df:

   Column_Name  Column_Type
0   Name         string
1   Age          integer
2   Id           string
3   DOB          Date

how can i find data type error value from main df. For example from column info df we can see 'Name' is a string column, so in main df, 'Name' column should contain either string or alphanumeric other than that it's an error. I need to find those datatype error values in a separate df.

error output df:

   Column_Name   Current_Value   Exp_Dtype   Index_No.
0  Name             444           string        6
1  Age              444           int           2
2  Name            56441          string        6
0  DOB             4aBc           Date          5
0  DOB             09031992       Date          6

i tried this:

for i,r in column_info_df.iterrows():
    if r['Column_Type'] == 'string':
          main_df[r['Column_Name']].loc[main_df[r['Column_Name']].str.match(r'[^a-z|A-Z]+')]
    elif r['Column_Type'] == 'integer':
          main_df[r['Column_Name']].loc[main_df[r['Column_Name']].str.match(r'[^0-9]+')]
    elif r['Column_Type'] == 'Date':

i have stuck here,because this RE is not catching every errors. i don't know how to go further?

If I understood what you did, you created separate dataframes, which contains infos about your main one.

What I suggest would be instead to use the build-in methods offered by pandas to deal with dataframes.

For instance, if you have a dataframe main , then:

main.info()

will give you the type of object for each column. Note that a column can contain only one type, as it is a series, which is itself a ndarray.

So your column name cannot have anything else but strings that you would have missed. Instead, you can have NaN values. You can check for them with the help of

main.describe()

I hope that helped :-)

Here is one way of using df.eval() ,

Note : though this will check based on pattern and return non matching values. However, note that this cannot check valid types, example if date column has an entry which looks like a date but is an invalid date, this wouldnot identify that:

d={"string":".str.contains(r'[a-z|A-Z]')","integer":".str.contains('^[0-9]*$')",
                                 "Date":".str.contains('\d\d-\d\d-\d\d\d\d')"}
m=df.eval([f"~{a}{b}" 
   for a,b in zip(column_info_df['Column_Name'],column_info_df['Column_Type'].map(d))]).T

final=(pd.DataFrame(np.where(m,df,np.nan),columns=df.columns)
              .reset_index().melt('index',var_name='Column_Name',
                            value_name='Current_Value').dropna())
final['Expected_dtype']=(final['Column_Name']
                         .map(column_info_df.set_index('Column_Name')['Column_Type']))
print(final)

Output :

    index Column_Name Current_Value Expected_dtype
6       6        Name           444         string
9       2         Age           AKL        integer
19      5          Id         56541         string
26      5         DOB         45aBc           Date
27      6         DOB      09031992           Date

I agree there can be better regex patterns for this job but the idea should be same.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM