简体   繁体   中英

pandas: sub-dataframe for the different types of values contained in an object dtype column made with mixed types of values?

A dataframe is built with 1 column containing strings. However, the values contained in those strings might be considered as other types of values, like int, float, boolean, dates, etc ... It is like mixed types of values would have been coded as strings:

import pandas as pd

mixed_values = {'col1': ["123", "3.14", "1010-10-01T01:01:01", "Say Hello'Hello !", "True" ]}

df = pd.DataFrame(data=mixed_values)

df

# it returns :
                  col1
0                  123
1                 3.14
2  1010-10-01T01:01:01
3    Say Hello'Hello !
4                 True


df.dtypes

# it returns :
col1    object
dtype: object

From there the question is :

  • How to get sub-dataframes by type of values contained in the object column (ie: per int, float, date, string, boolean) ?

Objective, would be, at the end, to get something like:

df_int
# it should returns ;
                  col1
0                  123

df_float
# it should returns ;
                  col1
1                 3.14

df_date
# it should returns ;
                  col1
2  1010-10-01T01:01:01

df_string
# it should returns ;
                  col1
3    Say Hello'Hello !

df_boolean
# it should returns ;
                  col1
4                 True

Update

Trying one solution .../...

The post "Determining the type of value from a string in python" , gives some information for a solution using a specific function to check the type of value contained inside a string.

The post "How to use ast.literal_eval in a pandas dataframe and handle exceptions" gives a solution to create a new column returning ast.literal_eval results.

Mixing both of those posts, a function could be built and then used to create a new column giving the type of the values contained in the object dtype column.

# sources :
# https://stackoverflow.com/questions/10261141/determine-type-of-value-from-a-string-in-python
# https://stackoverflow.com/questions/52232742/how-to-use-ast-literal-eval-in-a-pandas-dataframe-and-handle-exceptions

import ast, re

# the function is declared .../...
def gives_data_type_inside_string(val):
    if len(val) == 0: return ''
    try:
        t=ast.literal_eval(val)

    except ValueError:
        return 'STRING'
    except SyntaxError:
        return 'STRING'

    else:
        if type(t) in [int, float, bool]:
            if t in set((True,False)):
                return 'BOOLEAN'
            if type(t) is int:
                return 'INT'
            if type(t) is float:
                return 'FLOAT'
        else:
            return 'STRING'

# .../... then a second function is build
# to apply the above function to a dataframe .../...

def add_col_literal_return(input_df, col_in, col_out):
    # Avoid overwrite the original dataframe
    input_df[f"{col_out}"] = (
        input_df
        .copy()
        [f"{col_in}"]
        .apply(lambda x: gives_data_type_inside_string(x))
    )
    return input_df
    
# .../... then the above function is applied to the dataframe
df.pipe(add_col_literal_return, "col1", "value_type")

# .../... and then it returns the dataframe with the new columns:

                  col1 type_of_values
0                  123            INT
1                 3.14          FLOAT
2  1010-10-01T01:01:01         STRING
3    Say Hello'Hello !         STRING
4                 True        BOOLEAN


At this stage, a selection per type_of_values could give the targeted sub-dataframes.

df_int= df.query('type_of_values == "INT"')

df_float= df.query('type_of_values == "FLOAT"')

df_string= df.query('type_of_values == "STRING"')

df_boolean= df.query('type_of_values == "BOOLEAN"')

print(df_int)
# returns :

  col1 type_of_values
0  123            INT

print(df_float)
#returns

   col1 type_of_values
1  3.14          FLOAT

print(df_string)
# returns

                  col1 type_of_values
2  1010-10-01T01:01:01         STRING
3    Say Hello'Hello !         STRING

print(df_boolean)
# returns

   col1 type_of_values
4  True        BOOLEAN


Then this above code, seems to deliver a way to get the solution, except for date types.

Question

  • In pandas, is there a better way to detect type of values contained in a dtypes object column and to extract sub-dataframe per type of these values ?
  • What should be added to detect date types and get subset per date ?

For Timestamp s you can use to_datetime . So first you evaluate for numeric or boolean types, then you try pandas Timestamp , anything else must be a string.

import ast
import pandas as pd

df = pd.DataFrame({'col1': ["123", "3.14", "2010-10-01T01:01:01", "Say Hello'Hello !", "True" ]})

def evaluate(x):
    types = {int: 'INT', float: 'FLOAT', str: 'STRING', bool: 'BOOLEAN', pd.Timestamp: 'TIMESTAMP'}
    try:
        t = type(ast.literal_eval(x))
    except:
        try:
            t = type(pd.to_datetime(x))
        except:
            t = str
    return types[t]
                 
df['type_of_values'] = df.col1.apply(evaluate)

Result:

                  col1 type_of_values
0                  123            INT
1                 3.14          FLOAT
2  2010-10-01T01:01:01      TIMESTAMP
3    Say Hello'Hello !         STRING
4                 True        BOOLEAN

I don't recommend doing so , but if you like to can set your individual dataframes by directly manipulating the locals or globals dictionaries:

for g in df.groupby('type_of_values'):
    locals()[f'df_{g[0]}'] = g[1]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM