[英]Pandas: type error when creating a Pivot Table
I have a minimally reproducible dataset (which is retrieved from an internal database) that looks as follows:我有一个最低限度的可重现数据集(从内部数据库中检索),如下所示:
import pandas as pd
df = pd.DataFrame({'manufacturer':['BMW', 'Mercedes', 'Mercedes', 'Audi'],
'created_time':['2021-03-05T07:18:12.281-0600', '2021-03-04T15:34:23.373-0600', '2021-03-01T04:57:47.848-0600', '2021-02-25T09:31:37.341-0600'],
'action_time':['2021-03-05T08:32:19.153-0600', '2021-03-04T15:37:32.360-0600', '2021-03-01T08:37:39.083-0600', '2021-02-25T09:58:12.425-0600']})
df
manufacturer created_time action_time
0 BMW 2021-03-05T07:18:12.281-0600 2021-03-05T08:32:19.153-0600
1 Mercedes 2021-03-04T15:34:23.373-0600 2021-03-04T15:37:32.360-0600
2 Mercedes 2021-03-01T04:57:47.848-0600 2021-03-01T08:37:39.083-0600
3 Audi 2021-02-25T09:31:37.341-0600 2021-02-25T09:58:12.425-0600
I then create a custom column named elapsed_time
:然后我创建一个名为
elapsed_time
的自定义列:
df['created_time'] = pd.to_datetime(df['created_time'])
df['action_time'] = pd.to_datetime(df['action_time'])
time_threshold = pd.to_datetime('08:30').time()
df['created_time_adjusted']=df['created_time'].apply(lambda x:
x.replace(hour=8,minute=30,second=0)
if x.time()<time_threshold else x)
df['elapsed_time'] = (df['action_time'] - df['created_time_adjusted']).dt.total_seconds() /60
The updated dataframe looks as follows:更新后的 dataframe 如下所示:
column_headers = ['manufacturer', 'created_time', 'action_time', 'created_time_adjusted', 'elapsed_time']
df = df.reindex(columns=column_headers)
df
manufacturer created_time action_time created_time_adjusted elapsed_time
0 BMW 2021-03-05 07:18:12.281000-06:00 2021-03-05 08:32:19.153000-06:00 2021-03-05 08:30:00.281000-06:00 2.314533
1 Mercedes 2021-03-04 15:34:23.373000-06:00 2021-03-04 15:37:32.360000-06:00 2021-03-04 15:34:23.373000-06:00 3.149783
2 Mercedes 2021-03-01 04:57:47.848000-06:00 2021-03-01 08:37:39.083000-06:00 2021-03-01 08:30:00.848000-06:00 7.637250
3 Audi 2021-02-25 09:31:37.341000-06:00 2021-02-25 09:58:12.425000-06:00 2021-02-25 09:31:37.341000-06:00 26.584733
So far, so good.到目前为止,一切都很好。
The types look as follows:类型如下所示:
df.dtypes
manufacturer object
created_time datetime64[ns, pytz.FixedOffset(-360)]
action_time datetime64[ns, pytz.FixedOffset(-360)]
created_time_adjusted datetime64[ns, pytz.FixedOffset(-360)]
elapsed_time float64
dtype: object
Finally, I try to pivot the data to see the mean elapsed_time
by manufacturer
.最后,我尝试使用 pivot 的数据来查看
manufacturer
的平均elapsed_time
。 I do so as follows:我这样做如下:
pivoted_data = pd.pivot_table(data=df, index='manufacturer', values='elapsed_time', aggfunc=np.mean)
Which, on this toy data set , gives:其中,在这个玩具数据集上,给出:
pivoted_data
elapsed_time
manufacturer
Audi 26.584733
BMW 2.314533
Mercedes 5.393517
However, when I run this on the production dataset (which, to reiterate, has the same datatypes ), I see the following error:但是,当我在生产数据集上运行它时(重申一下,它具有相同的数据类型),我看到以下错误:
TypeError: '<' not supported between instances of 'CustomFieldOption' and 'CustomFieldOption'
The error indicates that it's a problem with types.该错误表明这是类型的问题。 But, I don't see how that can be when the types match between the toy dataset and the production dataset.
但是,当玩具数据集和生产数据集之间的类型匹配时,我不明白这会如何。
Does anyone know what's wrong here or how I can debug this further?有谁知道这里出了什么问题或者我该如何进一步调试?
Thanks!谢谢!
#################################################################### ################################################# ##################
UPDATE:更新:
After following the suggestion of @Icarwiz, I dug a little deeper and called the following on manufacturer
:在遵循@Icarwiz 的建议后,我挖得更深一点,并在
manufacturer
上调用了以下内容:
df['manufacturer].unique()
This resulted in:这导致:
array([<DB CustomFieldOption: value='BMW', id='32563'>,
<DB CustomFieldOption: value='Mercedes', id='32431'>,
<DB CustomFieldOption: value='Mercedes', id='32431'>,
<DB CustomFieldOption: value='Audi', id='28371'>],
dtype=object)
So, this is a complex data type.所以,这是一个复杂的数据类型。 Any idea where to go from here?
知道从这里到 go 的位置吗?
You say "the same datatype", but considering that for manufacturer it is object , it doesn't prove anything, as it can be anything from simple str to very complex objects.您说“相同的数据类型”,但考虑到制造商是object ,它不能证明任何事情,因为它可以是从简单的str到非常复杂的对象的任何东西。
And here it seems you have things in your columns manufacturer that are not strings...在这里,您的列制造商中似乎有一些不是字符串的东西......
Try this to find them:试试这个来找到它们:
df[df['manufacturer'].map(lambda x: type(x) is not str)]
EDIT:编辑:
Ok, so that's an answer... :)好的,这就是答案... :)
Now you just have to manage with this kind of object that I don't know... Educated guess, try converting your column to str with something like this:现在您只需要使用我不知道的这种 object 进行管理...有根据的猜测,请尝试使用以下内容将您的列转换为 str :
df['manufacturer'] = df['manufacturer'].map(lambda x: x.value)
But if it doesn't work you'll have to learn how this kind of object works...但如果它不起作用,你将不得不学习这种 object 是如何工作的......
test_obj = df['manufacturer'].iloc[0]
print(dir(test_obj))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.