简体   繁体   English

根据条件对熊猫数据框进行分组?

[英]Grouping Pandas dataframe based on conditions?

I am following the suggestions here pandas create new column based on values from other columns but still getting an error. 我遵循这里的建议, pandas根据其他列的值创建新列,但仍然出现错误。 Basically, my Pandas dataframe has many columns and I want to group the dataframe based on a new categorical column whose value depends on two existing columns (AMP, Time). 基本上,我的Pandas数据框有很多列,我想根据一个新的分类列对数据框进行分组,该列的值取决于两个现有列(AMP,Time)。

df
df['Time'] = pd.to_datetime(df['Time']) 
#making sure Time column read from the csv file is time object

import datetime as dt
day_1 = dt.date.today()
day_2 = dt.date.today() - dt.timedelta(days = 1)

def f(row):

    if (row['AMP'] > 100) & (row['Time'] > day_1):
        val = 'new_positives'

    elif (row['AMP'] > 100) & (day_2 <= row['Time'] <= day_1):
        val = 'rec_positives'

    elif (row['AMP'] > 100 & row['Time'] < day_2):
        val = 'old_positives'

    else:
        val = 'old_negatives'

    return val

df['GRP'] = df.apply(f, axis=1) #this gives the following error:
TypeError: ("Cannot compare type 'Timestamp' with type 'date'", 'occurred at index 0')

df[(df['AMP'] > 100) & (df['Time'] > day_1)]  #this works fine

df[(df['AMP'] > 100) & (day_2 <= df['Time'] <= day_1)]  #this works fine

df[(df['AMP'] > 100) & (df['Time'] < day_2)]  #this works fine


#df = df.groupby('GRP')  

I am able to select the proper sub-dataframes based on the conditions specified above, but when I apply the above function on each row, I get the error. 我可以根据上面指定的条件选择适当的子数据帧,但是当我在每行上应用上述函数时,会出现错误。 What is the correct approach to group the dataframe based on the conditions listed? 根据列出的条件对数据框进行分组的正确方法是什么?

EDIT: 编辑:

Unforunately, I cannot provide a sample of my dataframe. 不幸的是,我无法提供我的数据框示例。 However, here is simple dataframe that gives an error of the same type: 但是,以下是一个简单的数据框,它给出了相同类型的错误:

import numpy as np
import pandas as pd
mydf = pd.DataFrame({'a':np.arange(10),
   'b':np.random.rand(10)})

def f1(row):
    if row['a'] < 5 & row['b'] < 0.5:
        value = 'less'
    elif row['a'] < 5 & row['b'] > 0.5:
        value = 'more'
    else:
        value = 'same'
    return value

mydf['GRP'] = mydf.apply(f1, axis=1)

ypeError: ("unsupported operand type(s) for &: 'int' and 'float'", 'occurred at index 0')

EDIT 2: As suggested below, enclosing the comparison operator with parentheses did the trick for the cooked up example. 编辑2:如下面的建议,将比较运算符括在括号内可以完成示例。 This problem is solved. 这个问题解决了。

However, I am still getting the same error in my my real example. 但是,在我的实际示例中,我仍然遇到相同的错误。 By the way, if I were to use the column 'AMP' with perhaps another column in my table, then everything works and I am able to create df['GRP'] by applying the function f to each row. 顺便说一句,如果我将“ AMP”列与表中的另一列一起使用,那么一切正常,我可以通过将f应用于每行来创建df ['GRP']。 This shows the problem is related to using df['Time']. 这表明问题与使用df ['Time']有关。 But then why am I able to select df[(df['AMP'] > 100) & (df['Time'] > day_1)]? 但是,为什么我可以选择df [(df ['AMP']> 100)和(df ['Time']> day_1)]? Why would this work in this context, but not when the condition appears in a function? 为什么这会在这种情况下起作用,但是当条件出现在函数中时却不起作用?

Based on your error message and example, there are two things to fix. 根据您的错误消息和示例,有两件事要修复。 One is to adjust parentheses for operator precedence in your final elif statement. 一种是在最终的elif语句中为运算符优先级调整括号。 The other is to avoid mixing datetime.date and Timestamp objects. 另一个是避免混合datetime.dateTimestamp对象。

Fix 1: change this: 修复1:更改此:

elif (row['AMP'] > 100 & row['Time'] < day_2):

to this: 对此:

elif (row['AMP'] > 100) & (row['Time'] < day_2):

These two lines are different because the bitwise & operator takes precedence over the < and > comparison operators, so python attempts to evaluate 100 & row['Time'] . 这两行是不同的,因为按位&运算符优先于<>比较运算符,因此python尝试评估100 & row['Time'] A full list of Python operator precedence is here: https://docs.python.org/3/reference/expressions.html#operator-precedence Python运算符优先级的完整列表在这里: https : //docs.python.org/3/reference/expressions.html#operator-precedence

Fix 2: Change these 3 lines: 修复2:更改以下3行:

import datetime as dt
day_1 = dt.date.today()
day_2 = dt.date.today() - dt.timedelta(days = 1)

to these 2 lines: 这两行:

day1 = pd.to_datetime('today')
day_2 = day_1 - pd.DateOffset(days=1)

Some parentheses need to be added in the if-statements: 在if语句中需要添加一些括号:

import numpy as np
import pandas as pd

mydf = pd.DataFrame({'a':np.arange(10),
   'b':np.random.rand(10)})

def f1(row):
    if (row['a'] < 5) & (row['b'] < 0.5):
        value = 'less'
    elif (row['a'] < 5) & (row['b'] > 0.5):
        value = 'more'
    else:
        value = 'same'
    return value

mydf['GRP'] = mydf.apply(f1, axis=1)

If you don't need to use a custom function, then you can use multiple masks ( somewhat similar to this SO post ) 如果您不需要使用自定义函数,则可以使用多个掩码( 有点类似于此SO post

For the Time column , I used this code. 对于“ Time column ,我使用了此代码。 It may be that you were trying to compare Time column values that did not have the required dtype (??? this is my guess) 可能是您正在尝试比较没有所需dtype Time列值(这是我的猜测)

import datetime as dt
mydf['Time'] = pd.date_range(start='10/14/2018', end=dt.date.today())
day_1 = pd.to_datetime(dt.date.today())
day_2 = day_1 - pd.DateOffset(days = 1)

Here is the raw data 这是原始数据

mydf

   a         b       Time
0  0  0.550149 2018-10-14
1  1  0.889209 2018-10-15
2  2  0.845740 2018-10-16
3  3  0.340310 2018-10-17
4  4  0.613575 2018-10-18
5  5  0.229802 2018-10-19
6  6  0.013724 2018-10-20
7  7  0.810413 2018-10-21
8  8  0.897373 2018-10-22
9  9  0.175050 2018-10-23

One approach involves using masks for columns 一种方法涉及对列使用掩码

# Append new column
mydf['GRP'] = 'same'
# Use masks to change values in new column
mydf.loc[(mydf['a'] < 5) & (mydf['b'] < 0.5) & (mydf['Time'] < day_2), 'GRP'] = 'less'
mydf.loc[(mydf['a'] < 5) & (mydf['b'] > 0.5) & (mydf['Time'] > day_1), 'GRP'] = 'more'
mydf

   a         b       Time   GRP
0  0  0.550149 2018-10-14  same
1  1  0.889209 2018-10-15  same
2  2  0.845740 2018-10-16  same
3  3  0.340310 2018-10-17  less
4  4  0.613575 2018-10-18  same
5  5  0.229802 2018-10-19  same
6  6  0.013724 2018-10-20  same
7  7  0.810413 2018-10-21  same
8  8  0.897373 2018-10-22  same
9  9  0.175050 2018-10-23  same

Another approach is to set a , b and Time as a multi-index and use index-based masks to set values 另一种方法是将abTime为多索引,并使用基于索引的掩码来设置值

mydf.set_index(['a','b','Time'], inplace=True)

# Get Index level values
a = mydf.index.get_level_values('a')
b = mydf.index.get_level_values('b')
t = mydf.index.get_level_values('Time')

# Apply index-based masks
mydf['GRP'] = 'same'
mydf.loc[(a < 5) & (b < 0.5) & (t < day_2), 'GRP'] = 'less'
mydf.loc[(a < 5) & (b > 0.5) & (t > day_1), 'GRP'] = 'more'
mydf.reset_index(drop=False, inplace=True)
mydf

   a         b       Time   GRP
0  0  0.550149 2018-10-14  same
1  1  0.889209 2018-10-15  same
2  2  0.845740 2018-10-16  same
3  3  0.340310 2018-10-17  less
4  4  0.613575 2018-10-18  same
5  5  0.229802 2018-10-19  same
6  6  0.013724 2018-10-20  same
7  7  0.810413 2018-10-21  same
8  8  0.897373 2018-10-22  same
9  9  0.175050 2018-10-23  same

Source to filter by datetime and create a range of dates . 根据datetime筛选创建日期范围的源

You have a excelent example here, it is very useful and you could apply filters after groupby. 您在这里有一个出色的示例,它非常有用,可以在groupby之后应用过滤器。 It is a way without using mask. 这是不使用面具的一种方法。

def get_letter_type(letter):
   if letter.lower() in 'aeiou':
       return 'vowel'
   else:
       return 'consonant'


In [6]: grouped = df.groupby(get_letter_type, axis=1)

https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM