[英]Strange output from nunique after groupby
I am intrigued by the behavior of the following code:我对以下代码的行为很感兴趣:
import pandas as pd
df = pd.DataFrame({'Name':['A','A','B','B','B'],
'Date':['2020-01-01','2020-01-02','2020-01-01','2020-01-02','2020-01-03']})
df['Date'] = pd.to_datetime(df['Date'],infer_datetime_format=True)
df['Data_Points'] = df.groupby(['Name'])['Date'].transform('nunique')
print(df)
Which outputs:哪个输出:
Name Date Data_Points
0 A 2020-01-01 1970-01-01 00:00:00.000000002
1 A 2020-01-02 1970-01-01 00:00:00.000000002
2 B 2020-01-01 1970-01-01 00:00:00.000000003
3 B 2020-01-02 1970-01-01 00:00:00.000000003
4 B 2020-01-03 1970-01-01 00:00:00.000000003
The question comes as to why am I getting a datetime value after using transform('nunique')
if the documentation for pandas.Series.nunique clear states:如果transform('nunique')
的文档明确指出:
Return number of unique elements in the object.返回 object 中唯一元素的数量。
Returns: int返回:整数
And pandas.DataFrame.transform does not mention anything about retaining the dtype of the aggregated column, only:并且pandas.DataFrame.transform没有提到任何关于保留聚合列的 dtype 的内容,仅:
Call func on self producing a DataFrame with transformed values.调用 func 自行生成具有转换值的 DataFrame。 Produced DataFrame will have same axis length as self.生产的 DataFrame 将具有与自己相同的轴长。
So therefore, when combining both functions, why am I getting a datetime
instead of the int
as nunique()
says?因此,当结合这两个函数时,为什么我得到一个datetime
而不是nunique()
所说的int
呢? Does the aggregated dtype
have precedence over the function being passed in the transform()
method when defining the dtype
of the transformed column?在定义转换列的dtype
时,聚合的dtype
是否优先于在transform()
方法中传递的 function? Is this the expected behavior?这是预期的行为吗?
I think it is bug, possible solution:我认为这是错误,可能的解决方案:
df['Data_Points'] = df.groupby(['Name'])['Date'].transform(pd.Series.nunique)
print(df)
Name Date Data_Points
0 A 2020-01-01 2
1 A 2020-01-02 2
2 B 2020-01-01 3
3 B 2020-01-02 3
4 B 2020-01-03 3
IIUC, it is because the transformed result is inserted back as a datetime (the original) type. IIUC,这是因为转换后的结果作为日期时间(原始)类型插入回来。 Add astype(int)
solves it:添加astype(int)
解决它:
df.groupby('Name')["Date"].transform("nunique").astype(int)
Output: Output:
0 2
1 2
2 3
3 3
4 3
Name: Date, dtype: int64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.