[英]How to replace repeated NaNs with a different value from lone NaNs from Pandas data frame
I have several timeseries arranged in a data frame, similar to below:我在一个数据框中安排了几个时间序列,类似于以下内容:
category value time_idx
0 810 0.118794 0
1 830 0.552947 0
2 1120 0.133193 0
3 1370 0.840183 0
4 810 0.129385 1
... ... ... ...
6095 1370 0.157391 1523
6096 810 0.141377 1524
6097 830 0.212254 1524
6098 1120 0.069970 1524
6099 1370 0.134947 1524
Some values are NaN.有些值是 NaN。 What I would like is to replace any NaN values that are NOT repeated with 0, as I am assuming the value is 0 for that category at that time.
我想要的是用 0 替换任何未重复的 NaN 值,因为我假设当时该类别的值为 0。 However, any time that every single category has a value of NaN at that same time (ie at the same time_idx), then I want to replace every value with -1.
但是,任何时候每个类别同时具有 NaN 的值(即在同一时间_idx),然后我想用 -1 替换每个值。
Just replacing the NaNs with a value is of course trivial in Pandas, but the added complexity of specifically replacing NaNs that are NaN for every category at a given time has stumped me.在 Pandas 中,仅仅用一个值替换 NaN 当然是微不足道的,但是在给定时间专门替换每个类别的 NaN 的 NaN 的额外复杂性让我感到难过。 I know I can just loop through the time indices, but my actual datasets will have over 900 categories, so I would like to find a more efficient Pandas-esque method.
我知道我可以循环遍历时间索引,但我的实际数据集将有 900 多个类别,所以我想找到一种更有效的 Pandas 式方法。
The only thing I could think of was list comprehension, which I don't think is even necessarily more efficient than an explicit loop anyway, plus I couldn't come up with one that worked properly.我唯一能想到的就是列表理解,我认为它不一定比显式循环更有效,而且我想不出一个可以正常工作的方法。
I know that I can just replace all NaNs like so:我知道我可以像这样替换所有 NaN:
data["value"] = data["value"].replace(np.nan, 0)
but I'm not sure how to implement this in my case, where I only want to replace long NaNs with 0. This is the loop I have currently:但我不确定如何在我的情况下实现这一点,我只想用 0 替换长 NaN。这是我目前的循环:
num_channels = data["category"].nunique()
nan_vals = data[lambda x: np.isnan(x.value)]
nan_times = nan_vals["time_idx"]
for time in nan_times:
if nan_vals[lambda x: x.time_idx == time]["category"].nunique() < num_channels:
# Set 0 for every channel that has nan at time t
index = nan_vals[lambda x: x.time_idx == time].index
data.loc[index, ["value"]] = data.loc[index, "value"].replace(np.nan, 0)
else:
index = nan_vals[lambda x: x.time_idx == time].index
data.loc[index, ["value"]] = data[lambda x: x.time_idx == time]["value"].replace(np.nan, -1)
Any ideas are appreciated.任何想法表示赞赏。
Here is an example:这是一个例子:
given the following data frame:给定以下数据框:
category value time_idx
0 810 NaN 0
1 830 NaN 0
2 1120 NaN 0
3 1370 NaN 0
4 810 0.129385 1
5 830 NaN 1
6 1120 0.144378 1
7 1370 NaN 1
8 810 0.124334 2
9 830 0.487274 2
10 1120 0.119153 2
11 1370 0.871687 2
I would like this output:我想要这个 output:
category value time_idx
0 810 -1.000000 0
1 830 -1.000000 0
2 1120 -1.000000 0
3 1370 -1.000000 0
4 810 0.129385 1
5 830 0.000000 1
6 1120 0.144378 1
7 1370 0.000000 1
8 810 0.124334 2
9 830 0.487274 2
10 1120 0.119153 2
11 1370 0.871687 2
In this example, at time = 0 every category's value was NaN, so they would be replaced with -1.在此示例中,在时间 = 0 时每个类别的值为 NaN,因此它们将被替换为 -1。 At time = 1, there were non-NaN values, so any NaN values present (category 830 and 1370) would be replaced with 0.
在时间 = 1 时,存在非 NaN 值,因此存在的任何 NaN 值(类别 830 和 1370)都将替换为 0。
You can find those time_idx
where all entries are NaN using groupby
and then group.isna().all()
.您可以使用
groupby
找到所有条目均为 NaN 的time_idx
,然后使用group.isna().all()
。 You can use that mask to fill the NaNs with -1
.您可以使用该掩码用
-1
填充 NaN。
Afterwards fill all other NaNs with 0
using fillna
.然后使用
fillna
将所有其他 NaN 填充为0
。
all_nas = df.groupby("time_idx").value.apply(lambda group: group.isna().all())
df = df.set_index("time_idx")
df.loc[all_nas, "value"] = -1
df = df.reset_index().fillna(0)
print(df)
# time_idx category value
# 0 0 810 -1.000000
# 1 0 830 -1.000000
# 2 0 1120 -1.000000
# 3 0 1370 -1.000000
# 4 1 810 0.129385
# 5 1 830 0.000000
# 6 1 1120 0.144378
# 7 1 1370 0.000000
# 8 2 810 0.124334
# 9 2 830 0.487274
# 10 2 1120 0.119153
# 11 2 1370 0.871687
You can group by time_idx
and iterate over groups.您可以按
time_idx
并迭代组。 Then in each group count number of NaN
values in value
column.然后在每组中计算
value
列中NaN
值的数量。 Depending on the number of nans one can update value
column.根据 nans 的数量,可以更新
value
列。
import pandas as pd
df = pd.DataFrame(
{
'category': [810, 830, 1120, 810, 830, 1120, 810, 830, 1120],
'value': [None, None, None, 1, 2, None, None, None, 4],
'time_idx': [0, 0, 0, 1, 1, 1, 2, 2, 2],
}
)
print(df, end='\n\n')
for name, group in df.copy().groupby('time_idx'):
num_nans = group['value'].isnull().sum()
mask = (df['time_idx'] == name) & df['value'].isna()
if num_nans == len(group):
df.loc[mask, 'value'] = -1
else:
df.loc[mask, 'value'] = 0
print(df)
Output Output
category value time_idx
0 810 NaN 0
1 830 NaN 0
2 1120 NaN 0
3 810 1.0 1
4 830 2.0 1
5 1120 NaN 1
6 810 NaN 2
7 830 NaN 2
8 1120 4.0 2
category value time_idx
0 810 -1.0 0
1 830 -1.0 0
2 1120 -1.0 0
3 810 1.0 1
4 830 2.0 1
5 1120 0.0 1
6 810 0.0 2
7 830 0.0 2
8 1120 4.0 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.