A dataframe is built with 1 column containing strings. However, the values contained in those strings might be considered as other types of values, like int, float, boolean, dates, etc ... It is like mixed types of values would have been coded as strings:
import pandas as pd
mixed_values = {'col1': ["123", "3.14", "1010-10-01T01:01:01", "Say Hello'Hello !", "True" ]}
df = pd.DataFrame(data=mixed_values)
df
# it returns :
col1
0 123
1 3.14
2 1010-10-01T01:01:01
3 Say Hello'Hello !
4 True
df.dtypes
# it returns :
col1 object
dtype: object
From there the question is :
Objective, would be, at the end, to get something like:
df_int
# it should returns ;
col1
0 123
df_float
# it should returns ;
col1
1 3.14
df_date
# it should returns ;
col1
2 1010-10-01T01:01:01
df_string
# it should returns ;
col1
3 Say Hello'Hello !
df_boolean
# it should returns ;
col1
4 True
The post "Determining the type of value from a string in python" , gives some information for a solution using a specific function to check the type of value contained inside a string.
The post "How to use ast.literal_eval in a pandas dataframe and handle exceptions" gives a solution to create a new column returning ast.literal_eval
results.
Mixing both of those posts, a function could be built and then used to create a new column giving the type of the values contained in the object dtype column.
# sources :
# https://stackoverflow.com/questions/10261141/determine-type-of-value-from-a-string-in-python
# https://stackoverflow.com/questions/52232742/how-to-use-ast-literal-eval-in-a-pandas-dataframe-and-handle-exceptions
import ast, re
# the function is declared .../...
def gives_data_type_inside_string(val):
if len(val) == 0: return ''
try:
t=ast.literal_eval(val)
except ValueError:
return 'STRING'
except SyntaxError:
return 'STRING'
else:
if type(t) in [int, float, bool]:
if t in set((True,False)):
return 'BOOLEAN'
if type(t) is int:
return 'INT'
if type(t) is float:
return 'FLOAT'
else:
return 'STRING'
# .../... then a second function is build
# to apply the above function to a dataframe .../...
def add_col_literal_return(input_df, col_in, col_out):
# Avoid overwrite the original dataframe
input_df[f"{col_out}"] = (
input_df
.copy()
[f"{col_in}"]
.apply(lambda x: gives_data_type_inside_string(x))
)
return input_df
# .../... then the above function is applied to the dataframe
df.pipe(add_col_literal_return, "col1", "value_type")
# .../... and then it returns the dataframe with the new columns:
col1 type_of_values
0 123 INT
1 3.14 FLOAT
2 1010-10-01T01:01:01 STRING
3 Say Hello'Hello ! STRING
4 True BOOLEAN
At this stage, a selection per type_of_values could give the targeted sub-dataframes.
df_int= df.query('type_of_values == "INT"')
df_float= df.query('type_of_values == "FLOAT"')
df_string= df.query('type_of_values == "STRING"')
df_boolean= df.query('type_of_values == "BOOLEAN"')
print(df_int)
# returns :
col1 type_of_values
0 123 INT
print(df_float)
#returns
col1 type_of_values
1 3.14 FLOAT
print(df_string)
# returns
col1 type_of_values
2 1010-10-01T01:01:01 STRING
3 Say Hello'Hello ! STRING
print(df_boolean)
# returns
col1 type_of_values
4 True BOOLEAN
Then this above code, seems to deliver a way to get the solution, except for date types.
For Timestamp
s you can use to_datetime
. So first you evaluate for numeric or boolean types, then you try pandas Timestamp
, anything else must be a string.
import ast
import pandas as pd
df = pd.DataFrame({'col1': ["123", "3.14", "2010-10-01T01:01:01", "Say Hello'Hello !", "True" ]})
def evaluate(x):
types = {int: 'INT', float: 'FLOAT', str: 'STRING', bool: 'BOOLEAN', pd.Timestamp: 'TIMESTAMP'}
try:
t = type(ast.literal_eval(x))
except:
try:
t = type(pd.to_datetime(x))
except:
t = str
return types[t]
df['type_of_values'] = df.col1.apply(evaluate)
Result:
col1 type_of_values
0 123 INT
1 3.14 FLOAT
2 2010-10-01T01:01:01 TIMESTAMP
3 Say Hello'Hello ! STRING
4 True BOOLEAN
I don't recommend doing so , but if you like to can set your individual dataframes by directly manipulating the locals
or globals
dictionaries:
for g in df.groupby('type_of_values'):
locals()[f'df_{g[0]}'] = g[1]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.