简体   繁体   English

Pandas 根据每列的条件获取最后一个值的位置(高效)

[英]Pandas get postion of last value based on condition for each column (efficiently)

I want to get the information in which row the value 1 occurs last for each column of my dataframe. Given this last row index I want to calculate the "recency" of the occurence.我想获取我的 dataframe 的每一列中值1最后出现在哪一行的信息。鉴于最后一行索引,我想计算出现的“新近度”。 Like so:像这样:

>> df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
>> df
   a  b  c  d
0  0  1  1  0
1  0  1  0  0
2  1  1  0  0
3  0  1  0  0
4  0  1  1  0

Desired result:期望的结果:

>> calculate_recency_vector(df)
[3,1,1,None]

The desired result shows for each column "how many rows ago" the value 1 appeared for the last time.期望的结果显示每列“多少行之前”最后一次出现值1 Eg for the column a the value 1 appears last in the 3rd-last row, hence the recency of 3 in the result vector.例如,对于列a ,值1最后出现在倒数第三行,因此结果向量中的新近度为3 Any ideas how to implement this?任何想法如何实现这个?

Edit: to avoid confusion, I changed the desired output for the last column from 0 to None .编辑:为避免混淆,我将最后一列所需的 output 从0更改为None This column has no recency because the value 1 does not occur at all.此列没有新近度,因为根本没有出现值1

Edit II: Thanks for the great answers.编辑二:感谢您的精彩回答。 I have to calculate this recency vector approx, 150k times on dataframes shaped (42.250).我必须在形状为 (42.250) 的数据帧上计算这个近因向量大约 150k 次。 A more efficient solution would be much appreciated.更有效的解决方案将不胜感激。

A loop-less solution which is faster & cleaner:更快更清洁的无环路解决方案:

>> def calculate_recency_for_one_column(column: pd.Series) -> int:
>>     non_zero_values_of_col = column[column.astype(bool)]
>>     if non_zero_values_of_col.empty:
>>         return 0
>>     return len(column) - non_zero_values_of_col.index[-1]

>> df = pd.DataFrame({"a":[0,0,1,0,0],"b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})

>> df.apply(lambda column: calculate_recency_for_one_column(column),axis=0)

a    3
b    1
c    1
d    0
dtype: int64

Sidenote: Using pd.apply() is slow ( SO explanation ).旁注:使用pd.apply()很慢(如此解释)。 There exist faster solutions like using np.where or using apply(...,raw=True) .存在更快的解决方案,例如使用np.where或使用apply(...,raw=True) See this question for details.有关详细信息,请参阅此问题

With this example dataframe, you can define a function as follow:使用此示例 dataframe,您可以定义一个 function,如下所示:

def calculate_recency_vector(df: pd.DataFrame, condition: int) -> list:
    recency_vector = []

    for col in df.columns:
        last = 0
        for i, y in enumerate(df[col].to_list()):
            if y == condition:
                last = i

        recency = len(df[col].to_list()) - last
        if recency == len(df[col].to_list()):
            recency = None

        recency_vector.append(recency)

    return recency_vector

Running the function, it will return this:运行 function,它将返回:

calculate_recency_vector(df, 1)
[3, 1, 1, None]

One direct approach is to implement this function would be to use a loop to iterate through each column in the DataFrame, and within that loop, use another loop to iterate through each row in the column.实现此 function 的一种直接方法是使用循环遍历 DataFrame 中的每一列,并在该循环内使用另一个循环遍历列中的每一行。 For each row, check if the value is 1. If it is, update a variable to store the len(df[column])-index.对于每一行,检查值是否为 1。如果是,则更新变量以存储 len(df[column])-index。 After the inner loop finishes, return the stored value as the recency for that column.内部循环完成后,返回存储的值作为该列的新近度。 If 1 never appears in the column, return None.如果 1 从未出现在列中,则返回 None。

import pandas
def calculate_recency_vector(df):
    recency_vector = []
    for column in df:
        last_occurrence = None
        for index, value in df[column].iteritems():
            if value == 1:
                last_occurrence =len(df[column])-index
        recency_vector.append(last_occurrence)
    return recency_vector


df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
print(calculate_recency_vector(df))

This这个

df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
df.apply(lambda x : ([df.shape[0] - i for i ,v in x.items() if v==1] or [None])[-1], axis=0)

produces the desired output as a pd.Series , with the only diffrence that the result is float and None is replaced by pandas Nan , u could then take the desired column产生所需的 output 作为pd.Series ,唯一的区别是结果是 float 而None被 pandas Nan取代,然后你可以采用所需的列

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 获取每个熊猫 cloumn 值的最后条件值 - Get Last condition value for each pandas cloumn value 基于pandas中每列条件的列式值替换 - Column-wise value replacement based on a condition on each column in pandas Pandas 根据条件获取最后 X 行的最小值/最大值 - Pandas get min / max value of last X rows based on condition pandas:根据不同的算术条件获取每列内的计数 - pandas: get count within each column based on different arithmetic condition 如何根据 Pandas 中的条件获取姓氏 - How to get last name based on condition in Pandas 熊猫如何根据最后一个元素的条件保持值或更改列的值 - pandas how to keep value or change value of column based on condition from last element Pandas:如何获取一列中每个项目的最后每日值并从每行中的值中减去它 - Pandas: How to get last daily value for each item in one column and subtract it from the value in each row Pandas:根据 groupby sum 结果与另一列中的值的比较来修改每组中最后一个单元格的值 - Pandas: Modify the value of last cell in each group based on how the groupby sum result compares to the value in another column 在 Pandas/Python 中以最有效的方式根据条件复制列的最后看到的非空值 - Copy the last seen non empty value of a column based on a condition in most efficient way in Pandas/Python 根据 Pandas 中的条件将列值加一 - Increase column value by one based on condition in Pandas
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM