Pandas 根据每列的条件获取最后一个值的位置（高效）

Question

I want to get the information in which row the value 1 occurs last for each column of my dataframe. Given this last row index I want to calculate the "recency" of the occurence.我想获取我的 dataframe 的每一列中值1最后出现在哪一行的信息。鉴于最后一行索引，我想计算出现的“新近度”。 Like so:像这样：

>> df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
>> df
   a  b  c  d
0  0  1  1  0
1  0  1  0  0
2  1  1  0  0
3  0  1  0  0
4  0  1  1  0

Desired result:期望的结果：

>> calculate_recency_vector(df)
[3,1,1,None]

The desired result shows for each column "how many rows ago" the value 1 appeared for the last time.期望的结果显示每列“多少行之前”最后一次出现值1 。 Eg for the column a the value 1 appears last in the 3rd-last row, hence the recency of 3 in the result vector.例如，对于列a ，值1最后出现在倒数第三行，因此结果向量中的新近度为3 。 Any ideas how to implement this?任何想法如何实现这个？

Edit: to avoid confusion, I changed the desired output for the last column from 0 to None .编辑：为避免混淆，我将最后一列所需的 output 从0更改为None 。 This column has no recency because the value 1 does not occur at all.此列没有新近度，因为根本没有出现值1 。

Edit II: Thanks for the great answers.编辑二：感谢您的精彩回答。 I have to calculate this recency vector approx, 150k times on dataframes shaped (42.250).我必须在形状为 (42.250) 的数据帧上计算这个近因向量大约 150k 次。 A more efficient solution would be much appreciated.更有效的解决方案将不胜感激。

Answer 1

A loop-less solution which is faster & cleaner:更快更清洁的无环路解决方案：

>> def calculate_recency_for_one_column(column: pd.Series) -> int:
>>     non_zero_values_of_col = column[column.astype(bool)]
>>     if non_zero_values_of_col.empty:
>>         return 0
>>     return len(column) - non_zero_values_of_col.index[-1]

>> df = pd.DataFrame({"a":[0,0,1,0,0],"b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})

>> df.apply(lambda column: calculate_recency_for_one_column(column),axis=0)

a    3
b    1
c    1
d    0
dtype: int64

Sidenote: Using pd.apply() is slow ( SO explanation ).旁注：使用pd.apply()很慢（如此解释）。 There exist faster solutions like using np.where or using apply(...,raw=True) .存在更快的解决方案，例如使用np.where或使用apply(...,raw=True) 。 See this question for details.有关详细信息，请参阅此问题。

Answer 2

With this example dataframe, you can define a function as follow:使用此示例 dataframe，您可以定义一个 function，如下所示：

def calculate_recency_vector(df: pd.DataFrame, condition: int) -> list:
    recency_vector = []

    for col in df.columns:
        last = 0
        for i, y in enumerate(df[col].to_list()):
            if y == condition:
                last = i

        recency = len(df[col].to_list()) - last
        if recency == len(df[col].to_list()):
            recency = None

        recency_vector.append(recency)

    return recency_vector

Running the function, it will return this:运行 function，它将返回：

calculate_recency_vector(df, 1)
[3, 1, 1, None]

Answer 3

One direct approach is to implement this function would be to use a loop to iterate through each column in the DataFrame, and within that loop, use another loop to iterate through each row in the column.实现此 function 的一种直接方法是使用循环遍历 DataFrame 中的每一列，并在该循环内使用另一个循环遍历列中的每一行。 For each row, check if the value is 1. If it is, update a variable to store the len(df[column])-index.对于每一行，检查值是否为 1。如果是，则更新变量以存储 len(df[column])-index。 After the inner loop finishes, return the stored value as the recency for that column.内部循环完成后，返回存储的值作为该列的新近度。 If 1 never appears in the column, return None.如果 1 从未出现在列中，则返回 None。

import pandas
def calculate_recency_vector(df):
    recency_vector = []
    for column in df:
        last_occurrence = None
        for index, value in df[column].iteritems():
            if value == 1:
                last_occurrence =len(df[column])-index
        recency_vector.append(last_occurrence)
    return recency_vector


df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
print(calculate_recency_vector(df))

Answer 4

This这个

df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
df.apply(lambda x : ([df.shape[0] - i for i ,v in x.items() if v==1] or [None])[-1], axis=0)

produces the desired output as a pd.Series , with the only diffrence that the result is float and None is replaced by pandas Nan , u could then take the desired column产生所需的 output 作为pd.Series ，唯一的区别是结果是 float 而None被 pandas Nan取代，然后你可以采用所需的列

Pandas 根据每列的条件获取最后一个值的位置（高效）

问题描述

3 个解决方案

解决方案1
1 已采纳 2022-12-28 12:44:45

解决方案2
0 2022-12-25 12:10:25

解决方案3
0 2022-12-25 12:14:57

解决方案4
0 2022-12-25 17:07:33

Pandas 根据每列的条件获取最后一个值的位置（高效）

问题描述

3 个解决方案

解决方案1 1 已采纳 2022-12-28 12:44:45

解决方案2 0 2022-12-25 12:10:25

解决方案3 0 2022-12-25 12:14:57

解决方案4 0 2022-12-25 17:07:33

解决方案1
1 已采纳 2022-12-28 12:44:45

解决方案2
0 2022-12-25 12:10:25

解决方案3
0 2022-12-25 12:14:57

解决方案4
0 2022-12-25 17:07:33