Pandas：如何删除 Pandas 数据框中所有列的前导缺失值？

Question

With a pandas dataframe of the form:使用以下形式的熊猫数据框：

     A     B     C
ID                
1   10   NaN   NaN
2   20   NaN   NaN
3   28  10.0   NaN
4   32  18.0  10.0
5   34  22.0  16.0
6   34  24.0  20.0
7   34  26.0  21.0
8   34  26.0  22.0

How can I remove a varying number of initial missing values?如何删除不同数量的初始缺失值？ Initially, I'd like to forward fill the last values of the "new" columns so I'll end up with this:最初，我想转发填充“新”列的最后一个值，所以我最终会得到这个：

    A     B     C
0  10  10.0  10.0
1  20  18.0  16.0
2  28  22.0  20.0
3  32  24.0  21.0
4  34  26.0  22.0
5  34  26.0  22.0
6  34  26.0  22.0
7  34  26.0  22.0

But I guess it would be just as natural to have nans on the remaining rows too:但我想在剩余的行上也有 nans 也是很自然的：

    A     B     C
0  10  10.0  10.0
1  20  18.0  16.0
2  28  22.0  20.0
3  32  24.0  21.0
4  34  26.0  22.0
5  34  26.0   NaN
6  34   NaN   NaN
7  34   NaN   NaN

Here's a visual representation of the issue:这是问题的直观表示：

Before:前：

After:后：

I've come up with a cumbersome approach using a for loop where I remove the leading nans using df.dropna() , count the number of values I've removed (N), append the last available number N times, and build a new dataframe column by column.我想出了一个使用 for 循环的繁琐方法，我使用df.dropna()删除前导df.dropna() ，计算我删除的值的数量 (N)，附加最后一个可用数字 N 次，并构建一个逐列的新数据框。 But this turned out to be pretty slow for larger dataframes.但事实证明，对于较大的数据帧来说，这非常慢。 I feel like this is something that's already a built-in functionality of the omnipotent pandas library, but I haven't found anything so far.我觉得这已经是全能熊猫库的内置功能，但到目前为止我还没有找到任何东西。 Does anyone have a suggestion to a less cumbersome way of doing this?有没有人建议一种不那么麻烦的方法？

Complete code with a sample dataset:带有示例数据集的完整代码：

import pandas as pd
import numpy as np

# sample dataframe
df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8],
                    'A': [10,20,28,32,34,34,34,34],
                   'B': [np.nan, np.nan, 10,18,22,24,26,26],
                    'C': [np.nan, np.nan, np.nan,10,16,20,21,22]})
df=df.set_index('ID')

# container for dataframe
# to be built using a for loop
df_new=pd.DataFrame()

for col in df.columns:
    # drop missing values column by column
    ser = df[col]
    original_length = len(ser)
    ser_new = ser.dropna()

    # if leading values are removed for N rows.
    # append last value N times for the last rows
    if len(ser_new) <= original_length:
        N = original_length - len(ser_new)
        ser_append = [ser.iloc[-1]]*N
        #ser_append = [np.nan]*N
        ser_new = ser_new.append(pd.Series(ser_append), ignore_index=True)
    df_new[col]=ser_new

df_new

Answer 1

We can make use of shift and move each series by the number of missing values我们可以利用shift并根据缺失值的数量移动每个系列

d = df.isna().sum(axis=0).to_dict() # calculate the number of missing rows per column 

for k,v in d.items():
    df[k] = df[k].shift(-v).ffill()

-- ——

print(df)

   ID   A     B     C
0   1  10  10.0  10.0
1   2  20  18.0  16.0
2   3  28  22.0  20.0
3   4  32  24.0  21.0
4   5  34  26.0  22.0
5   6  34  26.0  22.0
6   7  34  26.0  22.0
7   8  34  26.0  22.0

Answer 2

Here is a pure Pandas solution.这是一个纯 Pandas 解决方案。 Use apply to shift the values up depending on number of leading NaN's and use ffill,使用 apply 根据前导 NaN 的数量向上移动值并使用填充，

df.apply(lambda x: x.shift(-x.isna().sum())).ffill()


    A      B       C
1   10  10.0    10.0
2   20  18.0    16.0
3   28  22.0    20.0
4   32  24.0    21.0
5   34  26.0    22.0
6   34  26.0    22.0
7   34  26.0    22.0
8   34  26.0    22.0

Pandas：如何删除 Pandas 数据框中所有列的前导缺失值？

问题描述

2 个解决方案

解决方案1
2 2020-03-30 16:38:18

解决方案2
2 已采纳 2020-03-30 16:50:08

Pandas：如何删除 Pandas 数据框中所有列的前导缺失值？

问题描述

2 个解决方案

解决方案1 2 2020-03-30 16:38:18

解决方案2 2 已采纳 2020-03-30 16:50:08

解决方案1
2 2020-03-30 16:38:18

解决方案2
2 已采纳 2020-03-30 16:50:08