I have the following operation which takes about 1s
to perform on a pandas dataframe with 200 columns:
for col in mycols:
values = [str(_item) if col_raw_type == 'object' else '{:f}'.format(_item)
for _item in df[col_name].dropna().tolist()
if (_item is not None) and str(_item)]
Is there a more optimal way to do this? It seems perhaps the tolist
operation is a bit slow?
What I'm trying to do here is convert something like:
field field2
'2014-01-01' 1.0000000
'2015-01-01' nan
Into something like this:
values_of_field_1 = ['2014-01-01', '2015-01-01']
values_of_field_2 = [1.00000,]
So I can then infer the type of the columns. For example, the end product I'd want would be to get:
type_of_field_1 = DATE # %Y-%m-%d
type_of_field_2 = INTEGER #
It looks like you're trying to cast entire Series
columns within a DataFrame
to a certain type. Taking this DataFrame
as an example:
>>> import pandas as pd
>>> import numpy as np
Create a DataFrame with columns with mixed types:
>>> df = pd.DataFrame({'a': [1, np.nan, 2, 'a', None, 'b'], 'b': [1, 2, 3, 4, 5, 6], 'c': [np.nan, np.nan, 2, 2, 'a', 'a']})
>>> df
a b c
0 1 1 NaN
1 NaN 2 NaN
2 2 3 2
3 a 4 2
4 None 5 a
5 b 6 a
>>> df.dtypes
a object
b int64
c object
dtype: object
>>> for col in df.select_dtypes('object'):
... print(col)
... print('\n'.join('{}: {}'.format(v, type(v)) for v in df[col]))
...
a
1: <class 'int'>
nan: <class 'float'>
2: <class 'int'>
a: <class 'str'>
None: <class 'NoneType'>
b: <class 'str'>
c
nan: <class 'float'>
nan: <class 'float'>
2: <class 'int'>
2: <class 'int'>
a: <class 'str'>
a: <class 'str'>
Use pd.Series.astype
to cast object
dtypes to str
:
>>> for col in df.select_dtypes('object'):
... df[col] = df[col].astype(str)
... print(col)
... print('\n'.join('{}: {}'.format(v, type(v)) for v in df[col]))
...
a
1: <class 'str'>
nan: <class 'str'>
2: <class 'str'>
a: <class 'str'>
None: <class 'str'>
b: <class 'str'>
c
nan: <class 'str'>
nan: <class 'str'>
2: <class 'str'>
2: <class 'str'>
a: <class 'str'>
a: <class 'str'>
If you think tolist() is making your code slow, then you can remove tolist(). There is no need to use tolist() at all. Below code would give you the same output.
for col in mycols:
values = [str(_item) if col_raw_type == 'object' else '{:f}'.format(_item)
for _item in df[col_name].dropna()
if (_item is not None) and str(_item)]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.