[英]how can i replace outliers values of each row in dataframe with NaN?
i tried the following code to replace outlier values from each row to NaN, but this return the entire dataframe with NaN, what am i doing wrong?我尝试使用以下代码将每行的异常值替换为 NaN,但这会将整个 dataframe 与 NaN 返回,我做错了什么?
the result should be this结果应该是这样
anyone can help?有人可以帮忙吗?
data = [['ANJSHD12', 140, 8, 99992, 0, 0, 0, 0, 1, 99999, 0,0, 0],
['ANJSHD15',10, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0],
['ANJSHD17',19, 18, 22, 19, 25, 18, 23, 22, 22, 17,16, 19]]
df = pd.DataFrame(data, columns=['MATRÍCULA','V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10','V11', 'V12'])
df
range = list(df.columns.values)[1:13]
q1 = df[range].quantile(0.25, axis=1)
q3 = df[range].quantile(0.75, axis=1)
iqr = q3-q1 #interquartile range
min = df.min(axis=1)
max = q3+3*iqr
df_filtered = df[(df[range] > min) & (df[range] < max)]
df_filtered
data = [['ANJSHD12', 140, 8, 99992, 0, 0, 0, 0, 1, 99999, 0,0, 0],
['ANJSHD15',10, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0],
['ANJSHD17',19, 18, 22, 19, 25, 18, 23, 22, 22, 17,16, 19]]
columns = ['MATRÍCULA','V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10','V11', 'V12']
df = pd.DataFrame(data, columns=columns).set_index('MATRÍCULA')
df
Here I am setting the 'MATRÍCULA'
column as an index with .set_index('MATRÍCULA')
.在这里,我将'MATRÍCULA'
列设置为带有.set_index('MATRÍCULA')
的索引。
This way, you won't have to select all the other columns every time.这样,您就不必每次都使用 select 所有其他列。 Alternatively, you could create a view and use it : only_values_df = df.iloc[:, 1:]
.或者,您可以创建一个视图并使用它: only_values_df = df.iloc[:, 1:]
。
Here, convert min_vals
and max_vals
to numpy arrays for comparison with the dataframe.在这里,将min_vals
和max_vals
转换为 numpy arrays 以便与 dataframe 进行比较。
If I left them as Series, I would get this Warning:如果我将它们保留为系列,我会收到以下警告:
FutureWarning: Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version. FutureWarning:DataFrame 与系列比较上的自动重新索引已被弃用,并将在未来版本中引发 ValueError。
and, actually, it would produce a wrong result, where all the values of the DataFrame would be False
.实际上,它会产生错误的结果,其中 DataFrame 的所有值都是False
。 This was the source of your problem.这是你问题的根源。
q1 = df[cols_range].quantile(0.25, axis=1)
q3 = df[cols_range].quantile(0.75, axis=1)
iqr = q3 - q1 # interquartile range
min_vals = df.min(axis=1).to_numpy().reshape(-1,1)
max_vals = (q3 + 3*iqr).to_numpy().reshape(-1,1)
Use >=
and <=
instead of >
and <
:使用>=
和<=
代替>
和<
:
df_filtered = df[(df >= min_vals) & (df <= max_vals)]
df_filtered
That works for me这对我行得通
df_filtered = df[range][df[range].ge(min, axis=0) & df[range].le(max, axis=0)]
If you run df[range] > min
or df[range] < max
you will see that the output is only false dataframe.如果你运行df[range] > min
或df[range] < max
你会看到 output 只是错误的 dataframe。 false & false = false
, so no element will be taken from df
and you get dataframe with Nan values only. false & false = false
,因此不会从df
中获取任何元素,并且您会得到仅具有 Nan 值的 dataframe 。 Not sure why it's the case (comparing DataFrame and Series must work like that, I'm not an expert in pandas).不知道为什么会这样(比较 DataFrame 和 Series 必须这样工作,我不是熊猫专家)。 Instead, use methods le
and ge
(you need non-strict inequality, because you want to keep values like 0).相反,使用方法le
和ge
(你需要非严格的不等式,因为你想保持像 0 这样的值)。
BTW.顺便提一句。 try not to use variable names like range
, min
, max
or other build-in functions or keywords尽量不要使用变量名称,如range
、 min
、 max
或其他内置函数或关键字
Use apply
and where
.使用apply
和where
。
Your df:你的df:
>>> df
MATRÍCULA V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
0 ANJSHD12 140 8 99992 0 0 0 0 1 99999 0 0 0
1 ANJSHD15 10 0 0 0 0 0 0 0 0 0 0 0
2 ANJSHD17 19 18 22 19 25 18 23 22 22 17 16 19
Using where
within apply
by column and reassigning your range.使用按列apply
的where
并重新分配您的范围。 (renamed with a trailing underscore to avoid overwriting the builtin) (用尾随下划线重命名以避免覆盖内置)
df[range_] = df[range_].apply(lambda col: col.where((col>=min_)&(col<=max_)))
output output
>>> df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
0 140.0 8 NaN 0 0 0 0 1 NaN 0 0 0
1 NaN 0 0.0 0 0 0 0 0 0.0 0 0 0
2 19.0 18 22.0 19 25 18 23 22 22.0 17 16 19
I think you'll benefit from choosing a different Interpolation method for your quantile
calculation.我认为您将从为quantile
计算选择不同的插值方法中受益。
# You overcomplicate how this works:
cols = df.columns[1:13]
# Note how I put `0.75` in `[]`, this changes how the output is returned.
q3 = df[cols].quantile([0.75], axis=1, interpolation='higher')
# Less than or equal to...
mask = (df[cols].le(q3.T.values)
# There's no reason to have this next line if you're just going to use the min...
& df[cols].ge(df[cols].min(axis=1).to_frame().values))
# Overwrite outlier values:
df[cols] = df[cols][mask]
print(df)
Output: Output:
MATRÍCULA V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
0 ANJSHD12 140.0 8 NaN 0 0.0 0 0.0 1 NaN 0 0 0
1 ANJSHD15 NaN 0 0.0 0 0.0 0 0.0 0 0.0 0 0 0
2 ANJSHD17 19.0 18 22.0 19 NaN 18 NaN 22 22.0 17 16 19
Assuming 'MATRÍCULA'
are unique, We can simplify this all with a groupby:假设'MATRÍCULA'
是唯一的,我们可以使用 groupby 来简化这一切:
df[cols] = (df.groupby('MATRÍCULA')[cols]
.apply(lambda x: x[x.le(x.quantile([0.75], axis=1, interpolation='higher').values)]))
print(df)
# Output:
# Same as Above!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.