如何在一组切片中对Pandas系列的连续NaN值进行分组？

Question

I want to merge consecutive NaN values into slices. 我想将连续的NaN值合并到切片中。 Is there a simple way of doing this with numpy or pandas? 是否有一种简单的方法可以用numpy或pandas做到这一点？

l = [
    (996, np.nan), (997, np.nan), (998, np.nan),
    (999, -47.3), (1000, -72.5), (1100, -97.7),
    (1200, np.nan), (1201, np.nan), (1205, -97.8),
    (1300, np.nan), (1302, np.nan), (1305, -97.9),
    (1400, np.nan), (1405, -97.10), (1408, np.nan)
]
l = pd.Series(dict(l))

Expected result: 预期结果：

[
    (slice(996, 999, None), array([nan, nan, nan])),
    (999, -47.3),
    (1000, -72.5),
    (1100, -97.7),
    (slice(1200, 1202, None), array([nan, nan])),
    (1205, -97.8),
    (slice(1300, 1301, None), array([nan])),
    (slice(1302, 1303, None), array([nan])),
    (1305, -97.9),
    (slice(1400, 1401, None), array([nan])),
    (1405, -97.1),
    (slice(1408, 1409, None), array([nan]))
]

A numpy array with two dimensions would be OK as well, rather than a list of tuples 具有两个维度的numpy数组也可以，而不是元组列表

Update 2019/05/31 : I have just realised that if I just use a dictionary instead of a Pandas Series the algorythm is much more efficient 更新2019/05/31 ：我刚刚意识到，如果我只使用字典而不是Pandas系列，那么algorythm会更有效率

Answer 1

What you want is full or corner cases, nan equality, first element of each pair being a slice or a single value, second being a np.array or a single value. 你想要的是完整或角落情况，nan相等，每对的第一个元素是一个切片或一个单独的值，第二个是np.array或单个值。

For so complex requirements, I would just rely on a plain Python non vectorized way: 对于如此复杂的需求，我只依赖于普通的Python非向量化方式：

def trans(ser):
    def build(last, cur, val):
        if cur == last + 1:
            if np.isnan(val):
                return (slice(last, cur), np.array([np.nan]))
            else:
                return (last, val)
        else:
            return (slice(last, cur), np.array([val] * (cur - last)))
    last = ser.iloc[0]
    old = last_index = ser.index[0]
    resul = []
    for i in ser.index[1:]:
        val = ser[i]
        if ((val != last) and not(np.isnan(val) and np.isnan(last))) \
           or i != old + 1:
            resul.append(build(last_index, old + 1, last))
            last_index = i
            last = val
        old = i
    resul.append(build(last_index, old+1, last))
    return resul

It gives something close to the expected result: 它提供了接近预期结果的东西：

[(slice(996, 999, None), array([nan, nan, nan])),
 (999, -47.3),
 (1000, -72.5),
 (1100, -97.7),
 (slice(1200, 1202, None), array([nan, nan])),
 (1205, -97.8),
 (slice(1300, 1301, None), array([nan])),
 (slice(1302, 1303, None), array([nan])),
 (1305, -97.9),
 (slice(1400, 1401, None), array([nan])),
 (1405, -97.1),
 (slice(1408, 1409, None), array([nan]))]

Answer 2

Group by cumsum of notnull is a good idea, but we need to filter out the first non-null value in each sub-series, so we can groupby the pair (cumsum, notnull) : cumsum的notnull是一个好主意，但是我们需要过滤掉每个子系列中的第一个非空值，这样我们就可以对该对进行(cumsum, notnull) ：

# convert series to frame, 
# don't know why series only doesn't work
df = l.to_frame(name='val')

df['notnull'] = df['val'].notnull()
g = df.groupby([ df['notnull'].cumsum(), 'notnull']).val

[(v.index, v.values) for i, v in g]

Out: 日期：

[(Int64Index([996, 997, 998], dtype='int64'), array([nan, nan, nan])),
 (Int64Index([1200, 1201], dtype='int64'), array([nan, nan])),
 (Int64Index([1300, 1302, 1400, 1402], dtype='int64'),
  array([nan, nan, nan, nan])),
 (Int64Index([999], dtype='int64'), array([-47.3])),
 (Int64Index([1000], dtype='int64'), array([-72.5])),
 (Int64Index([1100], dtype='int64'), array([-97.7])),
 (Int64Index([1202], dtype='int64'), array([-97.1]))]

Edit: taken the consecutive index in consideration and update for slices: 编辑：考虑连续索引并更新切片：

# convert group to slices
def get_slice(x):
    idx_min, idx_max = x.index.min(), x.index.max()

    if len(x) >1:
        return (slice(idx_min, idx_max+1), x.values)
    elif x.isna().any():
        return (slice(idx_min, idx_min+1), x.values)
    else:
        return (idx_min, x[idx_min])

df['notnull'] = df['val'].notnull()

# non-continuous indices
df['sep'] = (df.index != df.index.to_series().shift() + 1).cumsum()

g = df.groupby(['sep', df['notnull'].cumsum(), 'notnull']).val

g.apply(get_slice).values.tolist()

gives: 得到：

[(slice(996, 999, None), array([nan, nan, nan])),
 (999, -47.3),
 (1000, -72.5),
 (1100, -97.7),
 (slice(1200, 1202, None), array([nan, nan])),
 (1205, -97.8),
 (slice(1300, 1301, None), array([nan])),
 (slice(1302, 1303, None), array([nan])),
 (1305, -97.9),
 (slice(1400, 1401, None), array([nan])),
 (1405, -97.1),
 (slice(1408, 1409, None), array([nan]))]

如何在一组切片中对Pandas系列的连续NaN值进行分组？

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-05-23 11:37:18

解决方案2
1 2019-05-23 12:15:37

如何在一组切片中对Pandas系列的连续NaN值进行分组？

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-05-23 11:37:18

解决方案2 1 2019-05-23 12:15:37

解决方案1
2 已采纳 2019-05-23 11:37:18

解决方案2
1 2019-05-23 12:15:37