简体   繁体   中英

Pandas: type error when creating a Pivot Table

I have a minimally reproducible dataset (which is retrieved from an internal database) that looks as follows:

import pandas as pd

df = pd.DataFrame({'manufacturer':['BMW', 'Mercedes', 'Mercedes', 'Audi'],
                  'created_time':['2021-03-05T07:18:12.281-0600', '2021-03-04T15:34:23.373-0600', '2021-03-01T04:57:47.848-0600', '2021-02-25T09:31:37.341-0600'],
                  'action_time':['2021-03-05T08:32:19.153-0600', '2021-03-04T15:37:32.360-0600', '2021-03-01T08:37:39.083-0600', '2021-02-25T09:58:12.425-0600']})

df
    manufacturer    created_time                    action_time
0   BMW             2021-03-05T07:18:12.281-0600    2021-03-05T08:32:19.153-0600
1   Mercedes        2021-03-04T15:34:23.373-0600    2021-03-04T15:37:32.360-0600
2   Mercedes        2021-03-01T04:57:47.848-0600    2021-03-01T08:37:39.083-0600
3   Audi            2021-02-25T09:31:37.341-0600    2021-02-25T09:58:12.425-0600

I then create a custom column named elapsed_time :

df['created_time'] = pd.to_datetime(df['created_time'])
df['action_time'] = pd.to_datetime(df['action_time'])
time_threshold = pd.to_datetime('08:30').time()

df['created_time_adjusted']=df['created_time'].apply(lambda x:
                                                  x.replace(hour=8,minute=30,second=0)
                                                  if x.time()<time_threshold else x)

df['elapsed_time'] = (df['action_time'] - df['created_time_adjusted']).dt.total_seconds() /60

The updated dataframe looks as follows:

column_headers = ['manufacturer', 'created_time', 'action_time', 'created_time_adjusted', 'elapsed_time']
df = df.reindex(columns=column_headers)
df

    manufacturer   created_time                        action_time                        created_time_adjusted             elapsed_time
0   BMW            2021-03-05 07:18:12.281000-06:00    2021-03-05 08:32:19.153000-06:00   2021-03-05 08:30:00.281000-06:00  2.314533
1   Mercedes       2021-03-04 15:34:23.373000-06:00    2021-03-04 15:37:32.360000-06:00   2021-03-04 15:34:23.373000-06:00  3.149783
2   Mercedes       2021-03-01 04:57:47.848000-06:00    2021-03-01 08:37:39.083000-06:00   2021-03-01 08:30:00.848000-06:00  7.637250
3   Audi           2021-02-25 09:31:37.341000-06:00    2021-02-25 09:58:12.425000-06:00   2021-02-25 09:31:37.341000-06:00  26.584733

So far, so good.

The types look as follows:

df.dtypes
manufacturer                                             object
created_time             datetime64[ns, pytz.FixedOffset(-360)]
action_time              datetime64[ns, pytz.FixedOffset(-360)]
created_time_adjusted    datetime64[ns, pytz.FixedOffset(-360)]
elapsed_time                                            float64
dtype: object

Finally, I try to pivot the data to see the mean elapsed_time by manufacturer . I do so as follows:

pivoted_data = pd.pivot_table(data=df, index='manufacturer', values='elapsed_time', aggfunc=np.mean)

Which, on this toy data set , gives:

pivoted_data

    elapsed_time
manufacturer    
Audi        26.584733
BMW         2.314533
Mercedes    5.393517

However, when I run this on the production dataset (which, to reiterate, has the same datatypes ), I see the following error:

TypeError:  '<' not supported between instances of 'CustomFieldOption' and 'CustomFieldOption'

The error indicates that it's a problem with types. But, I don't see how that can be when the types match between the toy dataset and the production dataset.

Does anyone know what's wrong here or how I can debug this further?

Thanks!

####################################################################

UPDATE:

After following the suggestion of @Icarwiz, I dug a little deeper and called the following on manufacturer :

df['manufacturer].unique()

This resulted in:

array([<DB CustomFieldOption:  value='BMW', id='32563'>,
       <DB CustomFieldOption:  value='Mercedes', id='32431'>,
       <DB CustomFieldOption:  value='Mercedes', id='32431'>,
       <DB CustomFieldOption:  value='Audi', id='28371'>],
      dtype=object)

So, this is a complex data type. Any idea where to go from here?

You say "the same datatype", but considering that for manufacturer it is object , it doesn't prove anything, as it can be anything from simple str to very complex objects.

And here it seems you have things in your columns manufacturer that are not strings...

Try this to find them:

df[df['manufacturer'].map(lambda x: type(x) is not str)]

EDIT:

Ok, so that's an answer... :)

Now you just have to manage with this kind of object that I don't know... Educated guess, try converting your column to str with something like this:

df['manufacturer'] = df['manufacturer'].map(lambda x: x.value)

But if it doesn't work you'll have to learn how this kind of object works...

test_obj = df['manufacturer'].iloc[0]
print(dir(test_obj))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM