简体   繁体   English

在 groupby 之后来自 nunique 的奇怪 output

[英]Strange output from nunique after groupby

I am intrigued by the behavior of the following code:我对以下代码的行为很感兴趣:

import pandas as pd
df = pd.DataFrame({'Name':['A','A','B','B','B'],
                   'Date':['2020-01-01','2020-01-02','2020-01-01','2020-01-02','2020-01-03']})
df['Date'] = pd.to_datetime(df['Date'],infer_datetime_format=True)
df['Data_Points'] = df.groupby(['Name'])['Date'].transform('nunique')
print(df)

Which outputs:哪个输出:

  Name       Date                   Data_Points
0    A 2020-01-01 1970-01-01 00:00:00.000000002
1    A 2020-01-02 1970-01-01 00:00:00.000000002
2    B 2020-01-01 1970-01-01 00:00:00.000000003
3    B 2020-01-02 1970-01-01 00:00:00.000000003
4    B 2020-01-03 1970-01-01 00:00:00.000000003

The question comes as to why am I getting a datetime value after using transform('nunique') if the documentation for pandas.Series.nunique clear states:如果transform('nunique')文档明确指出:

Return number of unique elements in the object.返回 object 中唯一元素的数量。

Returns: int返回:整数

And pandas.DataFrame.transform does not mention anything about retaining the dtype of the aggregated column, only:并且pandas.DataFrame.transform没有提到任何关于保留聚合列的 dtype 的内容,仅:

Call func on self producing a DataFrame with transformed values.调用 func 自行生成具有转换值的 DataFrame。 Produced DataFrame will have same axis length as self.生产的 DataFrame 将具有与自己相同的轴长。

So therefore, when combining both functions, why am I getting a datetime instead of the int as nunique() says?因此,当结合这两个函数时,为什么我得到一个datetime而不是nunique()所说的int呢? Does the aggregated dtype have precedence over the function being passed in the transform() method when defining the dtype of the transformed column?在定义转换列的dtype时,聚合的dtype是否优先于在transform()方法中传递的 function? Is this the expected behavior?这是预期的行为吗?

I think it is bug, possible solution:我认为这是错误,可能的解决方案:

df['Data_Points'] = df.groupby(['Name'])['Date'].transform(pd.Series.nunique)
print(df)
  Name       Date  Data_Points
0    A 2020-01-01            2
1    A 2020-01-02            2
2    B 2020-01-01            3
3    B 2020-01-02            3
4    B 2020-01-03            3

IIUC, it is because the transformed result is inserted back as a datetime (the original) type. IIUC,这是因为转换后的结果作为日期时间(原始)类型插入回来。 Add astype(int) solves it:添加astype(int)解决它:

df.groupby('Name')["Date"].transform("nunique").astype(int)

Output: Output:

0    2
1    2
2    3
3    3
4    3
Name: Date, dtype: int64

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM