简体   繁体   中英

How to differentiate string and Alphanumeric?

df:

  company_name   product    Id     rating
0   matrix       mobile    Id456     2.5
1   ins-faq      alpha1    Id956     3.5
2   metric5      sounds-B  Id-356    2.5
3   ingsaf       digital   Id856     4star
4   matrix       win11p    Idklm     2.0
5   4567         mobile    596       3.5

df2:

  Col_name       Datatype
0 company_name   String        #(pure string)
1 Product        String        #(pure string)
2 Id             Alpha-Numeric #(must contain atleast 1 number and 1 alphabet)
3 rating         Float or int

df is the main dataframe and df2 is the expected datatype information of main dataframe.

how to check every column values extract wrong datatype values.

output:

  row_num   col_name      current_value  expected_dtype
0    2     company_name    metric5         string
1    5     company_name    4567            string
2    1     Product         alpha1          string
3    4     Product         win11p          string
4    4     Id              Idklm      Alpha-Numeric
5    5     Id              596        Alpha-Numeric
6    3    rating           4star      Float or int

For columns that cannot contain numbers, you can find the exceptions with:

In [5]: df['product'].str.contains(r'[0-9]')
Out[5]: 
0    False
1     True
2    False
3    False
4     True
5    False
Name: product, dtype: bool

For Alpha-Numeric columns identify compliance with:

In [7]: df['Id'].str.contains(r'(?:\d\D)|(?:\D\d)')
Out[7]: 
0     True
1     True
2     True
3     True
4    False
5    False
Name: Id, dtype: bool

For int or float columns find exceptions with

In [8]: df['rating'].str.contains(r'[^0-9.+-]')
Out[8]: 
0    False
1    False
2    False
3     True
4    False
5    False

That may be problematic, it won't catch things with multiple or misplaced plus,minus, or dot characters, like 9.4.1 or 6+3.-12 . But you could use

In [11]: def check(thing):
    ...:     try:
    ...:         return bool(float(thing)) or thing==0
    ...:     except ValueError:
    ...:         return False
    ...:     

In [12]: df['rating'].apply(check)
Out[12]: 
0     True
1     True
2     True
3    False
4     True
5     True
Name: rating, dtype: bool

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM