简体   繁体   English

Pandas:创建 Pivot 表时出现类型错误

[英]Pandas: type error when creating a Pivot Table

I have a minimally reproducible dataset (which is retrieved from an internal database) that looks as follows:我有一个最低限度的可重现数据集(从内部数据库中检索),如下所示:

import pandas as pd

df = pd.DataFrame({'manufacturer':['BMW', 'Mercedes', 'Mercedes', 'Audi'],
                  'created_time':['2021-03-05T07:18:12.281-0600', '2021-03-04T15:34:23.373-0600', '2021-03-01T04:57:47.848-0600', '2021-02-25T09:31:37.341-0600'],
                  'action_time':['2021-03-05T08:32:19.153-0600', '2021-03-04T15:37:32.360-0600', '2021-03-01T08:37:39.083-0600', '2021-02-25T09:58:12.425-0600']})

df
    manufacturer    created_time                    action_time
0   BMW             2021-03-05T07:18:12.281-0600    2021-03-05T08:32:19.153-0600
1   Mercedes        2021-03-04T15:34:23.373-0600    2021-03-04T15:37:32.360-0600
2   Mercedes        2021-03-01T04:57:47.848-0600    2021-03-01T08:37:39.083-0600
3   Audi            2021-02-25T09:31:37.341-0600    2021-02-25T09:58:12.425-0600

I then create a custom column named elapsed_time :然后我创建一个名为elapsed_time的自定义列:

df['created_time'] = pd.to_datetime(df['created_time'])
df['action_time'] = pd.to_datetime(df['action_time'])
time_threshold = pd.to_datetime('08:30').time()

df['created_time_adjusted']=df['created_time'].apply(lambda x:
                                                  x.replace(hour=8,minute=30,second=0)
                                                  if x.time()<time_threshold else x)

df['elapsed_time'] = (df['action_time'] - df['created_time_adjusted']).dt.total_seconds() /60

The updated dataframe looks as follows:更新后的 dataframe 如下所示:

column_headers = ['manufacturer', 'created_time', 'action_time', 'created_time_adjusted', 'elapsed_time']
df = df.reindex(columns=column_headers)
df

    manufacturer   created_time                        action_time                        created_time_adjusted             elapsed_time
0   BMW            2021-03-05 07:18:12.281000-06:00    2021-03-05 08:32:19.153000-06:00   2021-03-05 08:30:00.281000-06:00  2.314533
1   Mercedes       2021-03-04 15:34:23.373000-06:00    2021-03-04 15:37:32.360000-06:00   2021-03-04 15:34:23.373000-06:00  3.149783
2   Mercedes       2021-03-01 04:57:47.848000-06:00    2021-03-01 08:37:39.083000-06:00   2021-03-01 08:30:00.848000-06:00  7.637250
3   Audi           2021-02-25 09:31:37.341000-06:00    2021-02-25 09:58:12.425000-06:00   2021-02-25 09:31:37.341000-06:00  26.584733

So far, so good.到目前为止,一切都很好。

The types look as follows:类型如下所示:

df.dtypes
manufacturer                                             object
created_time             datetime64[ns, pytz.FixedOffset(-360)]
action_time              datetime64[ns, pytz.FixedOffset(-360)]
created_time_adjusted    datetime64[ns, pytz.FixedOffset(-360)]
elapsed_time                                            float64
dtype: object

Finally, I try to pivot the data to see the mean elapsed_time by manufacturer .最后,我尝试使用 pivot 的数据来查看manufacturer的平均elapsed_time I do so as follows:我这样做如下:

pivoted_data = pd.pivot_table(data=df, index='manufacturer', values='elapsed_time', aggfunc=np.mean)

Which, on this toy data set , gives:其中,在这个玩具数据集上,给出:

pivoted_data

    elapsed_time
manufacturer    
Audi        26.584733
BMW         2.314533
Mercedes    5.393517

However, when I run this on the production dataset (which, to reiterate, has the same datatypes ), I see the following error:但是,当我在生产数据集上运行它时(重申一下,它具有相同的数据类型),我看到以下错误:

TypeError:  '<' not supported between instances of 'CustomFieldOption' and 'CustomFieldOption'

The error indicates that it's a problem with types.该错误表明这是类型的问题。 But, I don't see how that can be when the types match between the toy dataset and the production dataset.但是,当玩具数据集和生产数据集之间的类型匹配时,我不明白这会如何。

Does anyone know what's wrong here or how I can debug this further?有谁知道这里出了什么问题或者我该如何进一步调试?

Thanks!谢谢!

#################################################################### ################################################# ##################

UPDATE:更新:

After following the suggestion of @Icarwiz, I dug a little deeper and called the following on manufacturer :在遵循@Icarwiz 的建议后,我挖得更深一点,并在manufacturer上调用了以下内容:

df['manufacturer].unique()

This resulted in:这导致:

array([<DB CustomFieldOption:  value='BMW', id='32563'>,
       <DB CustomFieldOption:  value='Mercedes', id='32431'>,
       <DB CustomFieldOption:  value='Mercedes', id='32431'>,
       <DB CustomFieldOption:  value='Audi', id='28371'>],
      dtype=object)

So, this is a complex data type.所以,这是一个复杂的数据类型。 Any idea where to go from here?知道从这里到 go 的位置吗?

You say "the same datatype", but considering that for manufacturer it is object , it doesn't prove anything, as it can be anything from simple str to very complex objects.您说“相同的数据类型”,但考虑到制造商是object ,它不能证明任何事情,因为它可以是从简单的str到非常复杂的对象的任何东西。

And here it seems you have things in your columns manufacturer that are not strings...在这里,您的列制造商中似乎有一些不是字符串的东西......

Try this to find them:试试这个来找到它们:

df[df['manufacturer'].map(lambda x: type(x) is not str)]

EDIT:编辑:

Ok, so that's an answer... :)好的,这就是答案... :)

Now you just have to manage with this kind of object that I don't know... Educated guess, try converting your column to str with something like this:现在您只需要使用我不知道的这种 object 进行管理...有根据的猜测,请尝试使用以下内容将您的列转换为 str :

df['manufacturer'] = df['manufacturer'].map(lambda x: x.value)

But if it doesn't work you'll have to learn how this kind of object works...但如果它不起作用,你将不得不学习这种 object 是如何工作的......

test_obj = df['manufacturer'].iloc[0]
print(dir(test_obj))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM