在 pandas 的每一行中，从第一个非 NaN 开始，X 值的 window 保持不变，而所有其他值都是 NaN

Question

Citizens of StackOverflow, StackOverflow 的公民，

I am currently running iterations over a dataframe that can be millions of rows long.我目前正在对可能有数百万行长的 dataframe 运行迭代。 In each row of my dataframe I have leading NaNs (desired), followed by values.在我的 dataframe 的每一行中，我都有前导 NaN（所需），然后是值。 I want to only have X number of values in each row, followed by NaN's after that.我只想在每行中有 X 个值，然后是 NaN。 Effectively I want a window of only X values, beginning with the first non-NaN and all other positions in the row will be NaN.实际上，我想要一个只有 X 值的 window，从第一个非 NaN 开始，行中的所有其他位置都是 NaN。

My solution is very slow.我的解决方案很慢。 Additionally, I didn't find similar questions to be sufficiently helpful (most concerned just first/last NaN).此外，我没有发现类似的问题有足够的帮助（最关心的只是第一个/最后一个 NaN）。

An example where the window size is 3: window 大小为 3 的示例：

import pandas as pd
import numpy as np

x = 3

data = {'2018Q3': [0,   np.nan,   np.nan,      np.nan,      np.nan], 
        '2018Q4': [1,      np.nan,   np.nan,       np.nan,    10],
        '2019Q1': [2,        3,    np.nan,      np.nan, 12],
        '2019Q2': [3,        4,    np.nan,      8,         14],
        '2019Q3': [4,        5,    np.nan,      9,         22]}  

df = pd.DataFrame.from_dict(data) 
print(df)

      2018Q3  2018Q4  2019Q1  2019Q2  2019Q3
0     0.0     1.0     2.0     3.0     4.0
1     NaN     NaN     3.0     4.0     5.0
2     NaN     NaN     NaN     NaN     NaN
3     NaN     NaN     NaN     8.0     9.0
4     NaN    10.0    12.0    14.0    22.0

Results should look like this:结果应如下所示：

   2018Q3  2018Q4  2019Q1  2019Q2  2019Q3
0     0.0     1.0     2.0     NaN     NaN
1     NaN     NaN     3.0     4.0     5.0
2     NaN     NaN     NaN     NaN     NaN
3     NaN     NaN     NaN     8.0     9.0
4     NaN    10.0    12.0    14.0     NaN

MY SOLUTION:我的解决方案：

def cut_excess_forecast(num_x, dataf): 
    Total_Col = len(dataf.columns.values) # total columns
    df_NEW = pd.DataFrame()
    for index, row in dataf.iterrows():
        nas = row.isnull().sum(axis =0)  # number of nulls
        good_data = nas +  num_x # gives number of columns that should be untouched
        if good_data >= Total_Col: # if number of columns to not be touched > available columns, pass
            pass # all data available is needed
        else:
            cutoff = Total_Col-good_data 
            row[-cutoff:] = np.nan #change to NaN excess columns in this row

        df_NEW = df_NEW.append(row.copy()) #append changed row to new index
    df_NEW.index = dataf.index #move over original index to the new dataframe
    return df_NEW.copy()

df2 = cut_excess_forecast(x, df)
print(df2)

Sorting is allowed, so long as the index is untouched.排序是允许的，只要索引不受影响。 Cheers and thanks in Advance.提前欢呼和感谢。

Answer 1

Try:尝试：

df.where(df.notna().cumsum(1)<4)

Output: Output：

   2018Q3  2018Q4  2019Q1  2019Q2  2019Q3
0     0.0     1.0     2.0     NaN     NaN
1     NaN     NaN     3.0     4.0     5.0
2     NaN     NaN     NaN     NaN     NaN
3     NaN     NaN     NaN     8.0     9.0
4     NaN    10.0    12.0    14.0     NaN

Explanation :说明：

df.notna() masks the NaN values with False and non- NaN values with True . df.notna()用False屏蔽NaN值，用True屏蔽非NaN值。

   2018Q3  2018Q4  2019Q1  2019Q2  2019Q3
0    True    True    True    True    True
1   False   False    True    True    True
2   False   False   False   False   False
3   False   False   False    True    True
4   False    True    True    True    True

Chain that with cumsum(1) will count the non- NaN values on the rows from left to right.使用cumsum(1)的链将从左到右计算行上的非NaN值。

   2018Q3  2018Q4  2019Q1  2019Q2  2019Q3
0       1       2       3       4       5
1       0       0       1       2       3
2       0       0       0       0       0
3       0       0       0       1       2
4       0       1       2       3       4

Then we compare to <4 to mask where the counts exceed the threshold 4 with False然后我们比较<4来掩盖计数超过阈值4的False

   2018Q3  2018Q4  2019Q1  2019Q2  2019Q3
0    True    True    True   False   False
1    True    True    True    True    True
2    True    True    True    True    True
3    True    True    True    True    True
4    True    True    True    True   False

Finally wrap that around .where to mask those cells with np.NaN .最后将其包裹在np.NaN .where这些单元格。

在 pandas 的每一行中，从第一个非 NaN 开始，X 值的 window 保持不变，而所有其他值都是 NaN

问题描述

1 个解决方案

解决方案1
3 已采纳 2021-01-21 21:51:03

在 pandas 的每一行中，从第一个非 NaN 开始，X 值的 window 保持不变，而所有其他值都是 NaN

问题描述

1 个解决方案

解决方案1 3 已采纳 2021-01-21 21:51:03

解决方案1
3 已采纳 2021-01-21 21:51:03