在熊猫（Python）中比较列值的最快方法是什么

Question

我有以下数据框：

import numpy as np
import pandas as pd

df = pd.DataFrame(np.array([[1, 1, 1, 1], [1, 1, np.nan, 1], [1, np.nan, 1, 1]]),
                    columns=['t', 't_1', 't_2', 't_3'])

实际上有大约 1000 万行。 我需要一种快速的方法来知道哪个是最后一个具有非空值的连续列。 以这个 df 为例，结果将是 ->

df_result = pd.DataFrame(np.array([[1, 1, 1, 1], [1, 1, np.nan, np.nan], [1, np.nan, np.nan, np.nan]]),
                    columns=['t', 't_1', 't_2', 't_3'])

目前我正在使用以下 lambda 函数执行此操作，但结果太慢：

def second_to_last_null(*args):
    for i in range(len(args)):
        if np.isnan(args[i]):
            return np.nan
        else:
            return args[-1]


df_result['t'] = df['t']
df_result['t_1_consecutive'] = df[['t', 't_1']].apply(lambda x: second_to_last_null(x.t, x.t_1), axis=1)
df_result['t_2_consecutive'] = df[['t', 't_1', 't_2']].apply(lambda x: second_to_last_null(x.t, x.t_1, x.t_2), axis=1)
df_result['t_3_consecutive'] = df[['t', 't_1', 't_2', 't_3']].apply(lambda x: second_to_last_null(x.t, x.t_1, x.t_2, x.t_3), axis=1)

有人可以建议在 Pandas 或 Numpy 中执行此操作的最快方法吗？ 关于为什么该方法比我的方法更好的简单技术解释。

Answer 1

在isna上尝试cumsum ，然后使用mask

df_result = df.mask(df.isna().cumsum(axis=1) >= 1)

输出：

     t  t_1  t_2  t_3
0  1.0  1.0  1.0  1.0
1  1.0  1.0  NaN  NaN
2  1.0  NaN  NaN  NaN

解释： df.isna()用True掩盖nan ，否则False 。 然后取cumsum(axis=1)可以让您找到到目前为止的累计nan数（在行上）。 最后，所有cumsum >= 1 表示该位置之前有一个nan 。

在熊猫（Python）中比较列值的最快方法是什么

问题描述

1 个解决方案

解决方案1
2 已采纳 2022-06-15 15:56:44

在熊猫（Python）中比较列值的最快方法是什么

问题描述

1 个解决方案

解决方案1 2 已采纳 2022-06-15 15:56:44

解决方案1
2 已采纳 2022-06-15 15:56:44