[英]Pandas: How to drop leading missing values for all columns in a pandas dataframe?
With a pandas dataframe of the form:使用以下形式的熊猫数据框:
A B C
ID
1 10 NaN NaN
2 20 NaN NaN
3 28 10.0 NaN
4 32 18.0 10.0
5 34 22.0 16.0
6 34 24.0 20.0
7 34 26.0 21.0
8 34 26.0 22.0
How can I remove a varying number of initial missing values?如何删除不同数量的初始缺失值? Initially, I'd like to forward fill the last values of the "new" columns so I'll end up with this:
最初,我想转发填充“新”列的最后一个值,所以我最终会得到这个:
A B C
0 10 10.0 10.0
1 20 18.0 16.0
2 28 22.0 20.0
3 32 24.0 21.0
4 34 26.0 22.0
5 34 26.0 22.0
6 34 26.0 22.0
7 34 26.0 22.0
But I guess it would be just as natural to have nans on the remaining rows too:但我想在剩余的行上也有 nans 也是很自然的:
A B C
0 10 10.0 10.0
1 20 18.0 16.0
2 28 22.0 20.0
3 32 24.0 21.0
4 34 26.0 22.0
5 34 26.0 NaN
6 34 NaN NaN
7 34 NaN NaN
Here's a visual representation of the issue:这是问题的直观表示:
Before:前:
After:后:
I've come up with a cumbersome approach using a for loop where I remove the leading nans using df.dropna()
, count the number of values I've removed (N), append the last available number N times, and build a new dataframe column by column.我想出了一个使用 for 循环的繁琐方法,我使用
df.dropna()
删除前导df.dropna()
,计算我删除的值的数量 (N),附加最后一个可用数字 N 次,并构建一个逐列的新数据框。 But this turned out to be pretty slow for larger dataframes.但事实证明,对于较大的数据帧来说,这非常慢。 I feel like this is something that's already a built-in functionality of the omnipotent pandas library, but I haven't found anything so far.
我觉得这已经是全能熊猫库的内置功能,但到目前为止我还没有找到任何东西。 Does anyone have a suggestion to a less cumbersome way of doing this?
有没有人建议一种不那么麻烦的方法?
Complete code with a sample dataset:带有示例数据集的完整代码:
import pandas as pd
import numpy as np
# sample dataframe
df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8],
'A': [10,20,28,32,34,34,34,34],
'B': [np.nan, np.nan, 10,18,22,24,26,26],
'C': [np.nan, np.nan, np.nan,10,16,20,21,22]})
df=df.set_index('ID')
# container for dataframe
# to be built using a for loop
df_new=pd.DataFrame()
for col in df.columns:
# drop missing values column by column
ser = df[col]
original_length = len(ser)
ser_new = ser.dropna()
# if leading values are removed for N rows.
# append last value N times for the last rows
if len(ser_new) <= original_length:
N = original_length - len(ser_new)
ser_append = [ser.iloc[-1]]*N
#ser_append = [np.nan]*N
ser_new = ser_new.append(pd.Series(ser_append), ignore_index=True)
df_new[col]=ser_new
df_new
We can make use of shift
and move each series by the number of missing values我们可以利用
shift
并根据缺失值的数量移动每个系列
d = df.isna().sum(axis=0).to_dict() # calculate the number of missing rows per column
for k,v in d.items():
df[k] = df[k].shift(-v).ffill()
-- ——
print(df)
ID A B C
0 1 10 10.0 10.0
1 2 20 18.0 16.0
2 3 28 22.0 20.0
3 4 32 24.0 21.0
4 5 34 26.0 22.0
5 6 34 26.0 22.0
6 7 34 26.0 22.0
7 8 34 26.0 22.0
Here is a pure Pandas solution.这是一个纯 Pandas 解决方案。 Use apply to shift the values up depending on number of leading NaN's and use ffill,
使用 apply 根据前导 NaN 的数量向上移动值并使用填充,
df.apply(lambda x: x.shift(-x.isna().sum())).ffill()
A B C
1 10 10.0 10.0
2 20 18.0 16.0
3 28 22.0 20.0
4 32 24.0 21.0
5 34 26.0 22.0
6 34 26.0 22.0
7 34 26.0 22.0
8 34 26.0 22.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.