简体   繁体   English

熊猫groupby()和agg()忽略错误

[英]Pandas groupby() and agg() ignore errors

UPDATED for completeness: 为完整性而更新:

import pandas as pd

dates = pd.to_datetime(['2017-10-01','2017-10-02','2017-10-03']).tolist()

df = pd.DataFrame({ 
            'day_of_week':['m','t','w'],
            'alpha':[1,2,3],
            'bravo'[4,5,6],
            'charlie':[7,8,9],
            'dates':dates
            })

agg_dik = {'alpha': sum,
           'bravo': sum,
           'charlie': max,
           'dates': sum}

df = df.groupby('day_of_week').agg(agg_dik).reset_index(drop = True)

And this throws an error on the sum of the datetimes. 这会在日期时间的总和上引发错误。 So I can avoid that if the dataframe truly has five columns, but I have dataframes with hundreds of columns and often build aggregate dictionary comprehensions like: 因此,如果数据框确实具有五列,但是我却拥有数百列的数据框,并且经常建立聚合字典理解,例如:

agg_dik = { c : max if 'e' in c else sum for c in cols }

However, when the groupby().agg() hits a series where sum is not allowed, it errors out. 但是,当groupby().agg()碰到不允许sum的序列时,它会出错。

So my question - is there a way to achieve the results I'm looking for but have pandas either drop the erroring columns or replace with NaN and continue on? 所以我的问题-有没有一种方法可以实现我想要的结果,但是让大熊猫要么放弃错误的列,要么用NaN替换并继续?

I've looked at a few other questions (like this one ), but they don't fully answer my question. 我看了其他几个问题(例如这个问题),但是它们并没有完全回答我的问题。

There are two issues at hand: 目前有两个问题:

  1. Your dictionary of functions may contain columns that are not in the dataframe you're working with. 您的函数字典可能包含不在您使用的数据框中的列。 In cases like that you will need to grab only the elements whose keys match the columns present in the dataframe. 在这种情况下,您只需要获取其键与数据框中存在的列匹配的元素即可。

  2. Some of your functions throw errors/exceptions that need to be caught. 您的某些函数会引发需要捕获的错误/异常。 Otherwise, that list line of your code will not work. 否则,您的代码列表行将不起作用。

The following is a solution that should handle these two cases: 以下是应处理这两种情况的解决方案:

import pandas as pd
import numpy as np

dates = pd.to_datetime(['2017-10-01','2017-10-02','2017-10-03'])

df = pd.DataFrame({ 
            'day_of_week': ['m','t','w'],
            'alpha': [1,2,3],
            'bravo': [4,5,6],
            'charlie': [7,8,9],
            'dates':dates
            })

def sum_(x):
    try:
        return np.sum(x)
    except:
        return np.nan

def max_(x):
    try:
        return np.max(x)
    except:
        return np.nan

agg_dik = {'alpha': sum_,
           'bravo': sum_,
           'charlie': max_,
           'delta': max_}

df = df.groupby('day_of_week').agg({k:v for k,v in agg_dik.items() if k in df}).reset_index(drop = True)

I hope this helps. 我希望这有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM