Improving on pandas tolist() performance

Question

I have the following operation which takes about 1s to perform on a pandas dataframe with 200 columns:

for col in mycols:
    values = [str(_item) if col_raw_type == 'object' else '{:f}'.format(_item) 
     for _item in df[col_name].dropna().tolist() 
     if (_item is not None) and str(_item)]

Is there a more optimal way to do this? It seems perhaps the tolist operation is a bit slow?

What I'm trying to do here is convert something like:

field         field2
'2014-01-01'  1.0000000
'2015-01-01'  nan

Into something like this:

values_of_field_1 = ['2014-01-01', '2015-01-01']
values_of_field_2 = [1.00000,]

So I can then infer the type of the columns. For example, the end product I'd want would be to get:

type_of_field_1 = DATE # %Y-%m-%d
type_of_field_2 = INTEGER #

Answer 1

It looks like you're trying to cast entire Series columns within a DataFrame to a certain type. Taking this DataFrame as an example:

>>> import pandas as pd
>>> import numpy as np

Create a DataFrame with columns with mixed types:

>>> df = pd.DataFrame({'a': [1, np.nan, 2, 'a', None, 'b'], 'b': [1, 2, 3, 4, 5, 6], 'c': [np.nan, np.nan, 2, 2, 'a', 'a']})
>>> df
      a  b    c
0     1  1  NaN
1   NaN  2  NaN
2     2  3    2
3     a  4    2
4  None  5    a
5     b  6    a
>>> df.dtypes
a    object
b     int64
c    object
dtype: object
>>> for col in df.select_dtypes('object'):
...     print(col)
...     print('\n'.join('{}: {}'.format(v, type(v)) for v in df[col]))
... 
a
1: <class 'int'>
nan: <class 'float'>
2: <class 'int'>
a: <class 'str'>
None: <class 'NoneType'>
b: <class 'str'>
c
nan: <class 'float'>
nan: <class 'float'>
2: <class 'int'>
2: <class 'int'>
a: <class 'str'>
a: <class 'str'>

Use pd.Series.astype to cast object dtypes to str :

>>> for col in df.select_dtypes('object'):
...     df[col] = df[col].astype(str)
...     print(col)
...     print('\n'.join('{}: {}'.format(v, type(v)) for v in df[col]))
... 
a
1: <class 'str'>
nan: <class 'str'>
2: <class 'str'>
a: <class 'str'>
None: <class 'str'>
b: <class 'str'>
c
nan: <class 'str'>
nan: <class 'str'>
2: <class 'str'>
2: <class 'str'>
a: <class 'str'>
a: <class 'str'>

Answer 2

If you think tolist() is making your code slow, then you can remove tolist(). There is no need to use tolist() at all. Below code would give you the same output.

for col in mycols:
    values = [str(_item) if col_raw_type == 'object' else '{:f}'.format(_item) 
     for _item in df[col_name].dropna()
     if (_item is not None) and str(_item)]

Improving on pandas tolist() performance

Question

2 answers

solution1
0 2018-12-24 18:28:37

solution2
0 2018-12-24 18:55:50

Improving on pandas tolist() performance

Question

2 answers

solution1 0 2018-12-24 18:28:37

solution2 0 2018-12-24 18:55:50

solution1
0 2018-12-24 18:28:37

solution2
0 2018-12-24 18:55:50