如何用来自 Pandas 数据帧的单独 NaN 的不同值替换重复的 NaN

Question

I have several timeseries arranged in a data frame, similar to below:我在一个数据框中安排了几个时间序列，类似于以下内容：


   category value   time_idx
0   810     0.118794    0
1   830     0.552947    0
2   1120    0.133193    0
3   1370    0.840183    0
4   810     0.129385    1
... ... ... ...
6095 1370   0.157391    1523
6096 810    0.141377    1524
6097 830    0.212254    1524
6098 1120   0.069970    1524
6099 1370   0.134947    1524

Some values are NaN.有些值是 NaN。 What I would like is to replace any NaN values that are NOT repeated with 0, as I am assuming the value is 0 for that category at that time.我想要的是用 0 替换任何未重复的 NaN 值，因为我假设当时该类别的值为 0。 However, any time that every single category has a value of NaN at that same time (ie at the same time_idx), then I want to replace every value with -1.但是，任何时候每个类别同时具有 NaN 的值（即在同一时间_idx），然后我想用 -1 替换每个值。

Just replacing the NaNs with a value is of course trivial in Pandas, but the added complexity of specifically replacing NaNs that are NaN for every category at a given time has stumped me.在 Pandas 中，仅仅用一个值替换 NaN 当然是微不足道的，但是在给定时间专门替换每个类别的 NaN 的 NaN 的额外复杂性让我感到难过。 I know I can just loop through the time indices, but my actual datasets will have over 900 categories, so I would like to find a more efficient Pandas-esque method.我知道我可以循环遍历时间索引，但我的实际数据集将有 900 多个类别，所以我想找到一种更有效的 Pandas 式方法。

The only thing I could think of was list comprehension, which I don't think is even necessarily more efficient than an explicit loop anyway, plus I couldn't come up with one that worked properly.我唯一能想到的就是列表理解，我认为它不一定比显式循环更有效，而且我想不出一个可以正常工作的方法。

I know that I can just replace all NaNs like so:我知道我可以像这样替换所有 NaN：

data["value"] = data["value"].replace(np.nan, 0)

but I'm not sure how to implement this in my case, where I only want to replace long NaNs with 0. This is the loop I have currently:但我不确定如何在我的情况下实现这一点，我只想用 0 替换长 NaN。这是我目前的循环：

num_channels = data["category"].nunique()
nan_vals = data[lambda x: np.isnan(x.value)]
nan_times = nan_vals["time_idx"]

for time in nan_times:
        if nan_vals[lambda x: x.time_idx == time]["category"].nunique() < num_channels:
            # Set 0 for every channel that has nan at time t
            index = nan_vals[lambda x: x.time_idx == time].index

            data.loc[index, ["value"]] =  data.loc[index, "value"].replace(np.nan, 0)

        else:

            index = nan_vals[lambda x: x.time_idx == time].index
            data.loc[index, ["value"]] = data[lambda x: x.time_idx == time]["value"].replace(np.nan, -1)

Any ideas are appreciated.任何想法表示赞赏。

Here is an example:这是一个例子：

given the following data frame:给定以下数据框：

    category    value   time_idx
0   810          NaN    0
1   830          NaN    0
2   1120         NaN    0
3   1370         NaN    0
4   810      0.129385   1
5   830          NaN    1
6   1120     0.144378   1
7   1370         NaN    1
8   810      0.124334   2
9   830      0.487274   2
10  1120     0.119153   2
11  1370     0.871687   2

I would like this output:我想要这个 output：

    category    value   time_idx
0   810        -1.000000    0
1   830        -1.000000    0
2   1120       -1.000000    0
3   1370       -1.000000    0
4   810         0.129385    1
5   830         0.000000    1
6   1120        0.144378    1
7   1370        0.000000    1
8   810         0.124334    2
9   830         0.487274    2
10  1120        0.119153    2
11  1370        0.871687    2

In this example, at time = 0 every category's value was NaN, so they would be replaced with -1.在此示例中，在时间 = 0 时每个类别的值为 NaN，因此它们将被替换为 -1。 At time = 1, there were non-NaN values, so any NaN values present (category 830 and 1370) would be replaced with 0.在时间 = 1 时，存在非 NaN 值，因此存在的任何 NaN 值（类别 830 和 1370）都将替换为 0。

Answer 1

You can find those time_idx where all entries are NaN using groupby and then group.isna().all() .您可以使用groupby找到所有条目均为 NaN 的time_idx ，然后使用group.isna().all() 。 You can use that mask to fill the NaNs with -1 .您可以使用该掩码用-1填充 NaN。

Afterwards fill all other NaNs with 0 using fillna .然后使用fillna将所有其他 NaN 填充为0 。

all_nas = df.groupby("time_idx").value.apply(lambda group: group.isna().all())
df = df.set_index("time_idx")
df.loc[all_nas, "value"] = -1
df = df.reset_index().fillna(0)
print(df)

#     time_idx  category     value
# 0          0       810 -1.000000
# 1          0       830 -1.000000
# 2          0      1120 -1.000000
# 3          0      1370 -1.000000
# 4          1       810  0.129385
# 5          1       830  0.000000
# 6          1      1120  0.144378
# 7          1      1370  0.000000
# 8          2       810  0.124334
# 9          2       830  0.487274
# 10         2      1120  0.119153
# 11         2      1370  0.871687

Answer 2

You can group by time_idx and iterate over groups.您可以按time_idx并迭代组。 Then in each group count number of NaN values in value column.然后在每组中计算value列中NaN值的数量。 Depending on the number of nans one can update value column.根据 nans 的数量，可以更新value列。


import pandas as pd

df = pd.DataFrame(
    {
        'category': [810, 830, 1120, 810, 830, 1120, 810, 830, 1120],
        'value': [None, None, None, 1, 2, None, None, None, 4],
        'time_idx': [0, 0, 0, 1, 1, 1, 2, 2, 2],
    }
)

print(df, end='\n\n')


for name, group in df.copy().groupby('time_idx'):
    num_nans = group['value'].isnull().sum()
    mask = (df['time_idx'] == name) & df['value'].isna()
    if num_nans == len(group):
        df.loc[mask, 'value'] = -1
    else:
        df.loc[mask, 'value'] = 0

print(df)

Output Output

   category  value  time_idx
0       810    NaN         0
1       830    NaN         0
2      1120    NaN         0
3       810    1.0         1
4       830    2.0         1
5      1120    NaN         1
6       810    NaN         2
7       830    NaN         2
8      1120    4.0         2

   category  value  time_idx
0       810   -1.0         0
1       830   -1.0         0
2      1120   -1.0         0
3       810    1.0         1
4       830    2.0         1
5      1120    0.0         1
6       810    0.0         2
7       830    0.0         2
8      1120    4.0         2

如何用来自 Pandas 数据帧的单独 NaN 的不同值替换重复的 NaN

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-01-28 16:03:21

解决方案2
0 2021-01-28 17:06:34

如何用来自 Pandas 数据帧的单独 NaN 的不同值替换重复的 NaN

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-01-28 16:03:21

解决方案2 0 2021-01-28 17:06:34

解决方案1
1 已采纳 2021-01-28 16:03:21

解决方案2
0 2021-01-28 17:06:34