简体   繁体   中英

Improving on pandas tolist() performance

I have the following operation which takes about 1s to perform on a pandas dataframe with 200 columns:

for col in mycols:
    values = [str(_item) if col_raw_type == 'object' else '{:f}'.format(_item) 
     for _item in df[col_name].dropna().tolist() 
     if (_item is not None) and str(_item)]

Is there a more optimal way to do this? It seems perhaps the tolist operation is a bit slow?

What I'm trying to do here is convert something like:

field         field2
'2014-01-01'  1.0000000
'2015-01-01'  nan

Into something like this:

values_of_field_1 = ['2014-01-01', '2015-01-01']
values_of_field_2 = [1.00000,]

So I can then infer the type of the columns. For example, the end product I'd want would be to get:

type_of_field_1 = DATE # %Y-%m-%d
type_of_field_2 = INTEGER #

It looks like you're trying to cast entire Series columns within a DataFrame to a certain type. Taking this DataFrame as an example:

>>> import pandas as pd
>>> import numpy as np

Create a DataFrame with columns with mixed types:

>>> df = pd.DataFrame({'a': [1, np.nan, 2, 'a', None, 'b'], 'b': [1, 2, 3, 4, 5, 6], 'c': [np.nan, np.nan, 2, 2, 'a', 'a']})
>>> df
      a  b    c
0     1  1  NaN
1   NaN  2  NaN
2     2  3    2
3     a  4    2
4  None  5    a
5     b  6    a
>>> df.dtypes
a    object
b     int64
c    object
dtype: object
>>> for col in df.select_dtypes('object'):
...     print(col)
...     print('\n'.join('{}: {}'.format(v, type(v)) for v in df[col]))
... 
a
1: <class 'int'>
nan: <class 'float'>
2: <class 'int'>
a: <class 'str'>
None: <class 'NoneType'>
b: <class 'str'>
c
nan: <class 'float'>
nan: <class 'float'>
2: <class 'int'>
2: <class 'int'>
a: <class 'str'>
a: <class 'str'>

Use pd.Series.astype to cast object dtypes to str :

>>> for col in df.select_dtypes('object'):
...     df[col] = df[col].astype(str)
...     print(col)
...     print('\n'.join('{}: {}'.format(v, type(v)) for v in df[col]))
... 
a
1: <class 'str'>
nan: <class 'str'>
2: <class 'str'>
a: <class 'str'>
None: <class 'str'>
b: <class 'str'>
c
nan: <class 'str'>
nan: <class 'str'>
2: <class 'str'>
2: <class 'str'>
a: <class 'str'>
a: <class 'str'>

If you think tolist() is making your code slow, then you can remove tolist(). There is no need to use tolist() at all. Below code would give you the same output.

for col in mycols:
    values = [str(_item) if col_raw_type == 'object' else '{:f}'.format(_item) 
     for _item in df[col_name].dropna()
     if (_item is not None) and str(_item)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM