[英]Compare three columns and choose the highest
I have a dataset that looks like the image below,我有一个如下图所示的数据集,
and my goal is compare the three last rows and choose the highest each time.我的目标是比较最后三行并每次选择最高的。
I have four new variables: empty = 0, cancel = 0, release = 0, undertermined = 0我有四个新变量:empty = 0、cancel = 0、release = 0、undertermined = 0
for index 0, the cancelCount is the highest, therefore cancel += 1. The undetermined is increased only if the three rows are the same.对于索引 0,cancelCount 是最高的,因此 cancel += 1。只有当三行相同时,未确定的才会增加。
Here is my failed code sample:这是我失败的代码示例:
empty = 0
cancel = 0
release = 0
undetermined = 0
if (df["emptyCount"] > df["cancelcount"]) & (df["emptyCount"] > df["releaseCount"]):
empty += 1
elif (df["cancelcount"] > df["emptyCount"]) & (df["cancelcount"] > df["releaseCount"]):
cancel += 1
elif (df["releasecount"] > df["emptyCount"]) & (df["releasecount"] > df["emptyCount"]):
release += 1
else:
undetermined += 1
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
In general, you should avoid looping.一般来说,您应该避免循环。 Here's an example of vectorized code that does what you need:这是一个满足您需要的矢量化代码示例:
# data of intereset
s = df[['emptyCount', 'cancelCount', 'releaseCount']]
# maximum by rows
max_vals = s.max(1)
# those are equal to max values:
equal_max = df.eq(max_vals, axis='rows').astype(int)
# If there are single maximum along the rows:
single_max = equal_max.sum(1)==1
# The values:
equal_max.mul(single_max, axis='rows').sum()
Output would be a series that looks like this: Output 将是一个如下所示的系列:
emmptyCount count1
cancelCount count2
releaseCount count3
dtype: int64
Fist we find the undetermined rows首先我们找到未确定的行
equal = (df['emptyCount'] == df['cancelcount']) | (df['cancelount'] == df['releaseCount'])
Then we find the max column of the determined rows然后我们找到确定行的最大列
max_arg = df.loc[~equal, ['emptyCount', 'cancelcount', 'releaseCount']].idxmax(axis=1)
And count them数一数
undetermined = equal.sum()
empty = (max_arg == 'emptyCount').sum()
cancel = (max_arg == 'cancelcount').sum()
release = (max_arg == 'releaseCount').sum()
import pandas as pd
import numpy as np
class thing(object):
def __init__(self):
self.value = 0
empty , cancel , release , undetermined = [thing() for i in range(4)]
dictt = { 0 : empty, 1 : cancel , 2 : release , 3 : undetermined }
df = pd.DataFrame({
'emptyCount': [2,4,5,7,3],
'cancelCount': [3,7,8,11,2],
'releaseCount': [2,0,0,5,3],
})
for i in range(1,4):
series = df.iloc[-4+i]
for j in range(len(series)):
if series[j] == series.max():
dictt[j].value +=1
cancel.value
A small script to get the maximum values:获取最大值的小脚本:
import numpy as np
emptyCount = [2,4,5,7,3]
cancelCount = [3,7,8,11,2]
releaseCount = [2,0,0,5,3]
# Here we use np.where to count instances where there is more than one index with the max value.
# np.where returns a tuple, so we flatten it using "for n in m"
count = [n for z in zip(emptyCount, cancelCount, releaseCount) for m in np.where(np.array(z) == max(z)) for n in m]
empty = count.count(0) # 1
cancel = count.count(1) # 4
release = count.count(2) # 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.