简体   繁体   English

在pandas数据帧中查找连续的Nans

[英]Find consecutive Nans in pandas dataframe

I would like to find consecutive nans in my dataframe columns, something like 我想在我的数据框列中找到连续的nans,比如

>>> df = pd.DataFrame([[np.nan, 2, np.nan],
...                    [3, 4, np.nan],
...                    [np.nan, np.nan, np.nan],
...                    [np.nan, 3, np.nan]],
...                    columns=list('ABC'))
>>> df
     A    B   C 
0  NaN  2.0 NaN 
1  3.0  4.0 NaN 
2  NaN  NaN NaN 
3  NaN  3.0 NaN 

would give 会给

>>> df
     A    B   C 
0  1.0  NaN 4.0 
1  NaN  NaN 4.0 
2  2.0  1.0 4.0 
3  2.0  NaN 4.0 

Use: 使用:

a = df.isnull()
b = a.ne(a.shift()).cumsum().apply(lambda x: x.map(x.value_counts())).where(a)
print (b)
     A    B  C
0  1.0  NaN  4
1  NaN  NaN  4
2  2.0  1.0  4
3  2.0  NaN  4

Detail: 详情:

#unique consecutive values
print (a.ne(a.shift()).cumsum())
   A  B  C
0  1  1  1
1  2  1  1
2  3  2  1
3  3  3  1

#count values per columns and map
print (a.ne(a.shift()).cumsum().apply(lambda x: x.map(x.value_counts())))
   A  B  C
0  1  2  4
1  1  2  4
2  2  1  4
3  2  1  4

#add NaNs by mask a
print (a.ne(a.shift()).cumsum().apply(lambda x: x.map(x.value_counts())).where(a))
     A    B  C
0  1.0  NaN  4
1  NaN  NaN  4
2  2.0  1.0  4
3  2.0  NaN  4

One column alternative: 一栏替代方案:

a = df['A'].isnull()
b = a.ne(a.shift()).cumsum()
c = b.map(b.value_counts()).where(a)

print (c)
0    1.0
1    NaN
2    2.0
3    2.0
Name: A, dtype: float64

IIUC... groupby + mask + isnull IIUC ... groupby + mask + isnull

df.apply(lambda x :x.groupby(x.isnull().diff().ne(0).cumsum()).transform(len).mask(~x.isnull()))
Out[751]: 
     A    B    C
0  1.0  NaN  4.0
1  NaN  NaN  4.0
2  2.0  1.0  4.0
3  2.0  NaN  4.0

For one column 对于一列

df.A.groupby(df.A.isnull().diff().ne(0).cumsum()).transform(len).mask(~df.A.isnull())
Out[756]: 
0    1.0
1    NaN
2    2.0
3    2.0
Name: A, dtype: float64

Not sure if that's too elegant but that's how I made it: 不确定这是不是太优雅,但我是如何做到的:

def f(ds):
    ds = ds.isnull()
    splits = np.split(ds, np.where(ds == False)[0])
    counts = [np.sum(v) for v in splits]
    return pd.concat([pd.Series(split).replace({False: np.nan, True: count}) 
                      for split, count in zip(splits, counts)])

df.apply(lambda x: f(x))

Explanation: 说明:

# Binarize the array
ds = ds.isnull()

# Split the array where we have False (former nan values)
splits = np.split(ds, np.where(ds == False)[0])

# Now just count the number of True values
counts = [np.sum(v) for v in splits]

# Concatenate series that contains the requested values
pd.concat([pd.Series(split).replace({False: np.nan, True: count}) 
           for split, count in zip(splits, counts)])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM