簡體   English   中英

在pandas數據幀中查找連續的Nans

[英]Find consecutive Nans in pandas dataframe

我想在我的數據框列中找到連續的nans,比如

>>> df = pd.DataFrame([[np.nan, 2, np.nan],
...                    [3, 4, np.nan],
...                    [np.nan, np.nan, np.nan],
...                    [np.nan, 3, np.nan]],
...                    columns=list('ABC'))
>>> df
     A    B   C 
0  NaN  2.0 NaN 
1  3.0  4.0 NaN 
2  NaN  NaN NaN 
3  NaN  3.0 NaN 

會給

>>> df
     A    B   C 
0  1.0  NaN 4.0 
1  NaN  NaN 4.0 
2  2.0  1.0 4.0 
3  2.0  NaN 4.0 

使用:

a = df.isnull()
b = a.ne(a.shift()).cumsum().apply(lambda x: x.map(x.value_counts())).where(a)
print (b)
     A    B  C
0  1.0  NaN  4
1  NaN  NaN  4
2  2.0  1.0  4
3  2.0  NaN  4

詳情:

#unique consecutive values
print (a.ne(a.shift()).cumsum())
   A  B  C
0  1  1  1
1  2  1  1
2  3  2  1
3  3  3  1

#count values per columns and map
print (a.ne(a.shift()).cumsum().apply(lambda x: x.map(x.value_counts())))
   A  B  C
0  1  2  4
1  1  2  4
2  2  1  4
3  2  1  4

#add NaNs by mask a
print (a.ne(a.shift()).cumsum().apply(lambda x: x.map(x.value_counts())).where(a))
     A    B  C
0  1.0  NaN  4
1  NaN  NaN  4
2  2.0  1.0  4
3  2.0  NaN  4

一欄替代方案:

a = df['A'].isnull()
b = a.ne(a.shift()).cumsum()
c = b.map(b.value_counts()).where(a)

print (c)
0    1.0
1    NaN
2    2.0
3    2.0
Name: A, dtype: float64

IIUC ... groupby + mask + isnull

df.apply(lambda x :x.groupby(x.isnull().diff().ne(0).cumsum()).transform(len).mask(~x.isnull()))
Out[751]: 
     A    B    C
0  1.0  NaN  4.0
1  NaN  NaN  4.0
2  2.0  1.0  4.0
3  2.0  NaN  4.0

對於一列

df.A.groupby(df.A.isnull().diff().ne(0).cumsum()).transform(len).mask(~df.A.isnull())
Out[756]: 
0    1.0
1    NaN
2    2.0
3    2.0
Name: A, dtype: float64

不確定這是不是太優雅,但我是如何做到的:

def f(ds):
    ds = ds.isnull()
    splits = np.split(ds, np.where(ds == False)[0])
    counts = [np.sum(v) for v in splits]
    return pd.concat([pd.Series(split).replace({False: np.nan, True: count}) 
                      for split, count in zip(splits, counts)])

df.apply(lambda x: f(x))

說明:

# Binarize the array
ds = ds.isnull()

# Split the array where we have False (former nan values)
splits = np.split(ds, np.where(ds == False)[0])

# Now just count the number of True values
counts = [np.sum(v) for v in splits]

# Concatenate series that contains the requested values
pd.concat([pd.Series(split).replace({False: np.nan, True: count}) 
           for split, count in zip(splits, counts)])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM