简体   繁体   English

熊猫最大日期逐行?

[英]Pandas max date by row?

The solution to the question asked here unfortunately does not solve this problem.不幸的是,这里提出的问题的解决方案并没有解决这个问题。 I'm using Python 3.6.2我正在使用 Python 3.6.2

The Dataframe, df :数据框, df

                            date1                        date2
rec0    2017-05-25 14:02:23+00:00    2017-05-25 14:34:43+00:00
rec1                          NaT    2017-05-16 19:37:43+00:00

To reproduce the problem:要重现问题:

import psycopg2
import pandas as pd
Timestamp = pd.Timestamp
NaT = pd.NaT

df = pd.DataFrame({'date1': [Timestamp('2017-05-25 14:02:23'), NaT],
                   'date2': [Timestamp('2017-05-25 14:34:43'), Timestamp('2017-05-16 19:37:43')]})

tz = psycopg2.tz.FixedOffsetTimezone(offset=0, name=None)
for col in ['date1', 'date2']:
    df[col] = pd.DatetimeIndex(df[col]).tz_localize(tz)
print(df.max(axis=1))

Both of the above columns have been converted using pd.to_datetime() to get the following column type: datetime64[ns, psycopg2.tz.FixedOffsetTimezone(offset=0, name=None)]上述两列均已使用pd.to_datetime()转换为以下列类型: datetime64[ns, psycopg2.tz.FixedOffsetTimezone(offset=0, name=None)]

Running df.max(axis=1) doesn't give an error but certainly provides the incorrect solution.运行df.max(axis=1)不会出错,但肯定会提供不正确的解决方案。

Output (incorrect):输出(不正确):

rec0   NaN
rec1   NaN
dtype: float64

The fix that I have in place is to apply a custom function to the df as written below:我的修复方法是apply自定义函数应用于 df ,如下所示:

def get_max(x):
    test = x.dropna()
    return max(test)
df.apply(get_max,axis=1)

Output (correct):输出(正确):

rec0   2017-05-25 14:34:43+00:00
rec1   2017-05-16 19:37:43+00:00
dtype: datetime64[ns, psycopg2.tz.FixedOffsetTimezone(offset=0, name=None)]

Maybe df.max() doesn't deal with date objects but only looks for floats ( docs ).也许df.max()不处理日期对象,而只查找浮点数( docs )。 Any idea why df.max(axis=1) only returns NaN ?知道为什么df.max(axis=1)只返回NaN吗?

After some testing, it looks like there is something wrong with pandas and psycopg2.tz.FixedOffsetTimezone .经过一些测试,看起来pandaspsycopg2.tz.FixedOffsetTimezone

If you try df.max(axis=0) it will work as expected, but as you indicate df.max(axis=1) will return a series of NaN .如果您尝试df.max(axis=0)它将按预期工作,但正如您所指出的df.max(axis=1)将返回一系列NaN If you do not use psycopg2.tz.FixedOffsetTimezone as tz , df.max(axis=1) will return the expected result.如果您不使用psycopg2.tz.FixedOffsetTimezone作为tzdf.max(axis=1)将返回预期结果。

Other manipulations will fail in this case, such as df.transpose .在这种情况下,其他操作将失败,例如df.transpose

Note that if you try df.values.max(axis=1) , you will get the expected result.请注意,如果您尝试df.values.max(axis=1) ,您将获得预期的结果。 So numpy.array seems to be able to deal with this.所以numpy.array似乎能够处理这个问题。 You should search in pandas Github issues ( like this one ) and maybe consider opening a new one if you can't find a fix.您应该在pandas Github 问题中搜索(例如这个),如果找不到修复程序,可以考虑打开一个新问题。

Another solution would be to drop psycopg2.tz.FixedOffsetTimezone , but you may have some reason to use this specifically.另一种解决方案是删除psycopg2.tz.FixedOffsetTimezone ,但您可能有一些理由专门使用它。

Using Pandas 1.0.5 with Python 3.8 I was still getting a series of Nans.在 Python 3.8 中使用 Pandas 1.0.5 我仍然得到一系列的 Nans。 Solved the issue by converting both columns to datetime and then adding skipna=True and numeric_only=False to the max() function:通过将两列转换为日期时间,然后将 skipna=True 和 numeric_only=False 添加到 max() 函数来解决该问题:

df['1'] = pd.to_datetime(df['1'], utc=True)
df['2'] = pd.to_datetime(df['2'], utc=True) 
df['3'] = df[['1', '2']].max(axis=1, skipna=True, numeric_only=False)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM