[英]In each row of pandas, starting at the first non-NaN a window of X values remains untouched while all other values are NaN
Citizens of StackOverflow, StackOverflow 的公民,
I am currently running iterations over a dataframe that can be millions of rows long.我目前正在对可能有数百万行长的 dataframe 运行迭代。 In each row of my dataframe I have leading NaNs (desired), followed by values.在我的 dataframe 的每一行中,我都有前导 NaN(所需),然后是值。 I want to only have X number of values in each row, followed by NaN's after that.我只想在每行中有 X 个值,然后是 NaN。 Effectively I want a window of only X values, beginning with the first non-NaN and all other positions in the row will be NaN.实际上,我想要一个只有 X 值的 window,从第一个非 NaN 开始,行中的所有其他位置都是 NaN。
My solution is very slow.我的解决方案很慢。 Additionally, I didn't find similar questions to be sufficiently helpful (most concerned just first/last NaN).此外,我没有发现类似的问题有足够的帮助(最关心的只是第一个/最后一个 NaN)。
An example where the window size is 3: window 大小为 3 的示例:
import pandas as pd
import numpy as np
x = 3
data = {'2018Q3': [0, np.nan, np.nan, np.nan, np.nan],
'2018Q4': [1, np.nan, np.nan, np.nan, 10],
'2019Q1': [2, 3, np.nan, np.nan, 12],
'2019Q2': [3, 4, np.nan, 8, 14],
'2019Q3': [4, 5, np.nan, 9, 22]}
df = pd.DataFrame.from_dict(data)
print(df)
2018Q3 2018Q4 2019Q1 2019Q2 2019Q3
0 0.0 1.0 2.0 3.0 4.0
1 NaN NaN 3.0 4.0 5.0
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN 8.0 9.0
4 NaN 10.0 12.0 14.0 22.0
Results should look like this:结果应如下所示:
2018Q3 2018Q4 2019Q1 2019Q2 2019Q3
0 0.0 1.0 2.0 NaN NaN
1 NaN NaN 3.0 4.0 5.0
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN 8.0 9.0
4 NaN 10.0 12.0 14.0 NaN
MY SOLUTION:我的解决方案:
def cut_excess_forecast(num_x, dataf):
Total_Col = len(dataf.columns.values) # total columns
df_NEW = pd.DataFrame()
for index, row in dataf.iterrows():
nas = row.isnull().sum(axis =0) # number of nulls
good_data = nas + num_x # gives number of columns that should be untouched
if good_data >= Total_Col: # if number of columns to not be touched > available columns, pass
pass # all data available is needed
else:
cutoff = Total_Col-good_data
row[-cutoff:] = np.nan #change to NaN excess columns in this row
df_NEW = df_NEW.append(row.copy()) #append changed row to new index
df_NEW.index = dataf.index #move over original index to the new dataframe
return df_NEW.copy()
df2 = cut_excess_forecast(x, df)
print(df2)
Sorting is allowed, so long as the index is untouched.排序是允许的,只要索引不受影响。 Cheers and thanks in Advance.提前欢呼和感谢。
Try:尝试:
df.where(df.notna().cumsum(1)<4)
Output: Output:
2018Q3 2018Q4 2019Q1 2019Q2 2019Q3
0 0.0 1.0 2.0 NaN NaN
1 NaN NaN 3.0 4.0 5.0
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN 8.0 9.0
4 NaN 10.0 12.0 14.0 NaN
Explanation :说明:
df.notna()
masks the NaN
values with False
and non- NaN
values with True
. df.notna()
用False
屏蔽NaN
值,用True
屏蔽非NaN
值。 2018Q3 2018Q4 2019Q1 2019Q2 2019Q3
0 True True True True True
1 False False True True True
2 False False False False False
3 False False False True True
4 False True True True True
cumsum(1)
will count the non- NaN
values on the rows from left to right.使用cumsum(1)
的链将从左到右计算行上的非NaN
值。 2018Q3 2018Q4 2019Q1 2019Q2 2019Q3
0 1 2 3 4 5
1 0 0 1 2 3
2 0 0 0 0 0
3 0 0 0 1 2
4 0 1 2 3 4
<4
to mask where the counts exceed the threshold 4
with False
然后我们比较<4
来掩盖计数超过阈值4
的False
2018Q3 2018Q4 2019Q1 2019Q2 2019Q3
0 True True True False False
1 True True True True True
2 True True True True True
3 True True True True True
4 True True True True False
.where
to mask those cells with np.NaN
.最后将其包裹在np.NaN
.where
这些单元格。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.