简体   繁体   English

如何用来自 Pandas 数据帧的单独 NaN 的不同值替换重复的 NaN

[英]How to replace repeated NaNs with a different value from lone NaNs from Pandas data frame

I have several timeseries arranged in a data frame, similar to below:我在一个数据框中安排了几个时间序列,类似于以下内容:


   category value   time_idx
0   810     0.118794    0
1   830     0.552947    0
2   1120    0.133193    0
3   1370    0.840183    0
4   810     0.129385    1
... ... ... ...
6095 1370   0.157391    1523
6096 810    0.141377    1524
6097 830    0.212254    1524
6098 1120   0.069970    1524
6099 1370   0.134947    1524

Some values are NaN.有些值是 NaN。 What I would like is to replace any NaN values that are NOT repeated with 0, as I am assuming the value is 0 for that category at that time.我想要的是用 0 替换任何未重复的 NaN 值,因为我假设当时该类别的值为 0。 However, any time that every single category has a value of NaN at that same time (ie at the same time_idx), then I want to replace every value with -1.但是,任何时候每个类别同时具有 NaN 的值(即在同一时间_idx),然后我想用 -1 替换每个值。

Just replacing the NaNs with a value is of course trivial in Pandas, but the added complexity of specifically replacing NaNs that are NaN for every category at a given time has stumped me.在 Pandas 中,仅仅用一个值替换 NaN 当然是微不足道的,但是在给定时间专门替换每个类别的 NaN 的 NaN 的额外复杂性让我感到难过。 I know I can just loop through the time indices, but my actual datasets will have over 900 categories, so I would like to find a more efficient Pandas-esque method.我知道我可以循环遍历时间索引,但我的实际数据集将有 900 多个类别,所以我想找到一种更有效的 Pandas 式方法。

The only thing I could think of was list comprehension, which I don't think is even necessarily more efficient than an explicit loop anyway, plus I couldn't come up with one that worked properly.我唯一能想到的就是列表理解,我认为它不一定比显式循环更有效,而且我想不出一个可以正常工作的方法。

I know that I can just replace all NaNs like so:我知道我可以像这样替换所有 NaN:

data["value"] = data["value"].replace(np.nan, 0)

but I'm not sure how to implement this in my case, where I only want to replace long NaNs with 0. This is the loop I have currently:但我不确定如何在我的情况下实现这一点,我只想用 0 替换长 NaN。这是我目前的循环:

num_channels = data["category"].nunique()
nan_vals = data[lambda x: np.isnan(x.value)]
nan_times = nan_vals["time_idx"]

for time in nan_times:
        if nan_vals[lambda x: x.time_idx == time]["category"].nunique() < num_channels:
            # Set 0 for every channel that has nan at time t
            index = nan_vals[lambda x: x.time_idx == time].index

            data.loc[index, ["value"]] =  data.loc[index, "value"].replace(np.nan, 0)

        else:

            index = nan_vals[lambda x: x.time_idx == time].index
            data.loc[index, ["value"]] = data[lambda x: x.time_idx == time]["value"].replace(np.nan, -1)

Any ideas are appreciated.任何想法表示赞赏。

Here is an example:这是一个例子:

given the following data frame:给定以下数据框:

    category    value   time_idx
0   810          NaN    0
1   830          NaN    0
2   1120         NaN    0
3   1370         NaN    0
4   810      0.129385   1
5   830          NaN    1
6   1120     0.144378   1
7   1370         NaN    1
8   810      0.124334   2
9   830      0.487274   2
10  1120     0.119153   2
11  1370     0.871687   2

I would like this output:我想要这个 output:

    category    value   time_idx
0   810        -1.000000    0
1   830        -1.000000    0
2   1120       -1.000000    0
3   1370       -1.000000    0
4   810         0.129385    1
5   830         0.000000    1
6   1120        0.144378    1
7   1370        0.000000    1
8   810         0.124334    2
9   830         0.487274    2
10  1120        0.119153    2
11  1370        0.871687    2

In this example, at time = 0 every category's value was NaN, so they would be replaced with -1.在此示例中,在时间 = 0 时每个类别的值为 NaN,因此它们将被替换为 -1。 At time = 1, there were non-NaN values, so any NaN values present (category 830 and 1370) would be replaced with 0.在时间 = 1 时,存在非 NaN 值,因此存在的任何 NaN 值(类别 830 和 1370)都将替换为 0。

You can find those time_idx where all entries are NaN using groupby and then group.isna().all() .您可以使用groupby找到所有条目均为 NaN 的time_idx ,然后使用group.isna().all() You can use that mask to fill the NaNs with -1 .您可以使用该掩码用-1填充 NaN。

Afterwards fill all other NaNs with 0 using fillna .然后使用fillna将所有其他 NaN 填充为0

all_nas = df.groupby("time_idx").value.apply(lambda group: group.isna().all())
df = df.set_index("time_idx")
df.loc[all_nas, "value"] = -1
df = df.reset_index().fillna(0)
print(df)

#     time_idx  category     value
# 0          0       810 -1.000000
# 1          0       830 -1.000000
# 2          0      1120 -1.000000
# 3          0      1370 -1.000000
# 4          1       810  0.129385
# 5          1       830  0.000000
# 6          1      1120  0.144378
# 7          1      1370  0.000000
# 8          2       810  0.124334
# 9          2       830  0.487274
# 10         2      1120  0.119153
# 11         2      1370  0.871687

You can group by time_idx and iterate over groups.您可以按time_idx并迭代组。 Then in each group count number of NaN values in value column.然后在每组中计算value列中NaN值的数量。 Depending on the number of nans one can update value column.根据 nans 的数量,可以更新value列。


import pandas as pd

df = pd.DataFrame(
    {
        'category': [810, 830, 1120, 810, 830, 1120, 810, 830, 1120],
        'value': [None, None, None, 1, 2, None, None, None, 4],
        'time_idx': [0, 0, 0, 1, 1, 1, 2, 2, 2],
    }
)

print(df, end='\n\n')


for name, group in df.copy().groupby('time_idx'):
    num_nans = group['value'].isnull().sum()
    mask = (df['time_idx'] == name) & df['value'].isna()
    if num_nans == len(group):
        df.loc[mask, 'value'] = -1
    else:
        df.loc[mask, 'value'] = 0

print(df)

Output Output

   category  value  time_idx
0       810    NaN         0
1       830    NaN         0
2      1120    NaN         0
3       810    1.0         1
4       830    2.0         1
5      1120    NaN         1
6       810    NaN         2
7       830    NaN         2
8      1120    4.0         2

   category  value  time_idx
0       810   -1.0         0
1       830   -1.0         0
2      1120   -1.0         0
3       810    1.0         1
4       830    2.0         1
5      1120    0.0         1
6       810    0.0         2
7       830    0.0         2
8      1120    4.0         2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM