[英]If else Function with several conditions -Python
I am trying to calculate the final revenue of my dataset.我正在尝试计算我的数据集的最终收入。 My dataset has several revenue streams, but given some conditions (that I will explain later) the revenue per client will be calculated differently for the final revenue.我的数据集有几个收入流,但在某些条件下(我将在后面解释),每个客户的收入将针对最终收入进行不同的计算。
I am not very comfortable creating functions yet so I'm not sure where I am making mistakes.我还不太习惯创建函数,所以我不确定我在哪里犯了错误。
Dataframe examples: Dataframe 示例:
ClientId Sector Class Rev1 Rev2 Rev3
1 Sect_1 B 5 1 0
2 Sect_2 A 5.5 2 0
3 Sect_3 B 6 1.5 1
4 Sect_4 A 5 1 1.5
5 Sect_5 B 5 2 1
I want to create a 7th column 'Final_Rev' given the following conditions :给定以下条件,我想创建第 7 列“Final_Rev”:
- If 'Sector' = (Sect_3 or Sect_4) : 'Final_Rev' = Rev2 + Rev3
- OR if 'Class' = ("A") : 'Final_Rev' = Rev2 + Rev3
- Otherwise 'Final_Rev' = Rev1
Expected Output should be the following:预期的 Output应如下所示:
ClientId Sector Class Rev1 Rev2 Rev3 Final_Rev
1 Sect_1 B 5 1 0 5
2 Sect_2 A 5.5 2 0 2
3 Sect_3 B 6 1.5 1 2.5
4 Sect_4 A 5 1 1.5 2.5
5 Sect_5 B 5 2 1 5
I have tried to create the following function but I'm not sure what I'm doing wrong:我试图创建以下 function 但我不确定我做错了什么:
def Final_Rev():
if Sector in ['Sect_3','Sect_4'] or Class == 'A':
return df['Rev2'] + df['Rev3']
else:
return df['Rev1']
df['Final_Rev'] = df.apply(Final_Rev, axis=1)
I have found an R solution that does what I want but I don't know how to convert it to python:我找到了一个 R 解决方案,可以满足我的要求,但我不知道如何将其转换为 python:
Final_Rev := ifelse(test = (Sector %in% c("Sect_3","Sect_4")|Class == "A"),
yes = Rev2 + Rev3,
no = Rev1
If someone could help me solve this, it would be really appreciate.如果有人可以帮助我解决这个问题,将不胜感激。 Thanks.谢谢。
You can use np.where
:您可以使用np.where
:
df['Final_Rev'] = np.where(df['Sector'].isin(['Sect_3','Sect_4']) | (df['Class'] == 'A'),
df['Rev2'] + df['Rev3'],
df['Rev1'])
Output: Output:
ClientId Sector Class Rev1 Rev2 Rev3 Final_Rev
0 1 Sect_1 B 5.0 1.0 0.0 5.0
1 2 Sect_2 A 5.5 2.0 0.0 2.0
2 3 Sect_3 B 6.0 1.5 1.0 2.5
3 4 Sect_4 A 5.0 1.0 1.5 2.5
4 5 Sect_5 B 5.0 2.0 1.0 5.0
apply takes a function as its first argument which takes the column or row as a pandas.Series, so your function needs to take this as an argument. apply 将function作为其第一个参数,它将列或行作为 pandas.Series,因此您的 function 需要将此作为参数。
import pandas as pd
def foo(ds):
if ds['A'] == 1:
return 26
elif ds['B'] == 4:
return 27
else:
return 2*ds['A'] + 3*ds['B']
df = pd.DataFrame(columns=['A', 'B'], data = [[1,2],[3,4],[5,6]])
df['C'] = df.apply(foo, axis=1)
A B C
0 1 2 26
1 3 4 27
2 5 6 28
You can get your desired columns with a single expression:您可以使用单个表达式获得所需的列:
df['Final_Rev'] = df['Rev1'].where(
~(df['Sector'].isin({'Sect_3', 'Sect_4'}) | (df['Class'] == 'A')),
df['Rev2'] + df['Rev3'])
This is the fastest approach because it doesn't require apply
at all.这是最快的方法,因为它根本不需要apply
。
Generally speaking and for readability, I would recommend to make masks corresponding to each sub-clause of your condition.一般来说,为了便于阅读,我建议制作与您的条件的每个子条款相对应的掩码。 In your case, there are two possible results (either rev1
or rev2 + rev3
).在您的情况下,有两种可能的结果( rev1
或rev2 + rev3
)。 The first is a default value, the second depends on a single condition: Sector in {'Sect_3', 'Sect_4'} or Class == 'A'
.第一个是默认值,第二个取决于一个条件: Sector in {'Sect_3', 'Sect_4'} or Class == 'A'
。 Therefore:所以:
mask = df['Sector'].isin({'Sect_3', 'Sect_4'}) | (df['Class'] == 'A')
df['Final_Rev'] = df['Rev1'] # default value
df.loc[mask, 'Final_Rev'] = df.loc[mask, 'Rev2'] + df.loc[mask, 'Rev3']
If you insist on calling a Python function on every row, you can also do that, but it will be way slower:如果您坚持在每一行上调用 Python function,您也可以这样做,但会慢很多:
def myfunc(r):
if r.Sector == 'Sect_3' or r.Sector == 'Sect_4' or r.Class == 'A':
return r.Rev2 + r.Rev3
return r.Rev1
df.apply(myfunc, axis=1)
# out:
0 5.0
1 2.0
2 2.5
3 2.5
4 5.0
Performance :性能:
Why do I say the .where()
form is the fastest?为什么我说.where()
形式是最快的? Because it is all vectorized, and there is no need for repeated calls into a Python function.因为它都是矢量化的,不需要重复调用 Python function。
Here is a test:这是一个测试:
n = int(1e5)
df = pd.DataFrame({
'ClientId': np.arange(n),
'Sector': np.random.choice([f'Sect_{k}' for k in range(1, 8)], size=n),
'Class': np.random.choice(list('ABCDEF'), size=n),
'Rev1': np.random.randint(0, 20, size=n) * 0.5,
'Rev2': np.random.randint(0, 20, size=n) * 0.5,
'Rev3': np.random.randint(0, 20, size=n) * 0.5,
})
%timeit df['Rev1'].where(~(df['Sector'].isin({'Sect_3', 'Sect_4'}) | (df['Class'] == 'A')), df['Rev2'] + df['Rev3'])
10.9 ms ± 417 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.apply(myfunc, axis=1)
2.34 s ± 8.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The first form is over 200x faster !第一种形式快 200 倍以上!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.