简体   繁体   English

如何用 NaN 替换 dataframe 中每一行的异常值?

[英]how can i replace outliers values of each row in dataframe with NaN?

i tried the following code to replace outlier values from each row to NaN, but this return the entire dataframe with NaN, what am i doing wrong?我尝试使用以下代码将每行的异常值替换为 NaN,但这会将整个 dataframe 与 NaN 返回,我做错了什么?

the code i'm trying我正在尝试的代码

the result should be this结果应该是这样

anyone can help?有人可以帮忙吗?

data = [['ANJSHD12', 140, 8, 99992, 0, 0, 0, 0, 1, 99999, 0,0, 0],
    ['ANJSHD15',10, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0], 
    ['ANJSHD17',19, 18, 22, 19, 25, 18, 23, 22, 22, 17,16, 19]]
df = pd.DataFrame(data, columns=['MATRÍCULA','V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10','V11', 'V12'])
df

range = list(df.columns.values)[1:13]

q1 = df[range].quantile(0.25, axis=1)
q3 = df[range].quantile(0.75, axis=1)
iqr = q3-q1 #interquartile range
min  = df.min(axis=1)
max = q3+3*iqr
df_filtered = df[(df[range] > min) & (df[range] < max)]
df_filtered
data = [['ANJSHD12', 140, 8, 99992, 0, 0, 0, 0, 1, 99999, 0,0, 0],
        ['ANJSHD15',10, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0], 
        ['ANJSHD17',19, 18, 22, 19, 25, 18, 23, 22, 22, 17,16, 19]]
columns = ['MATRÍCULA','V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10','V11', 'V12']
df = pd.DataFrame(data, columns=columns).set_index('MATRÍCULA')
df

Here I am setting the 'MATRÍCULA' column as an index with .set_index('MATRÍCULA') .在这里,我将'MATRÍCULA'列设置为带有.set_index('MATRÍCULA')的索引。

This way, you won't have to select all the other columns every time.这样,您就不必每次都使用 select 所有其他列。 Alternatively, you could create a view and use it : only_values_df = df.iloc[:, 1:] .或者,您可以创建一个视图并使用only_values_df = df.iloc[:, 1:]


Here, convert min_vals and max_vals to numpy arrays for comparison with the dataframe.在这里,将min_valsmax_vals转换为 numpy arrays 以便与 dataframe 进行比较。

If I left them as Series, I would get this Warning:如果我将它们保留为系列,我会收到以下警告:

FutureWarning: Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version. FutureWarning:DataFrame 与系列比较上的自动重新索引已被弃用,并将在未来版本中引发 ValueError。

and, actually, it would produce a wrong result, where all the values of the DataFrame would be False .实际上,它会产生错误的结果,其中 DataFrame 的所有值都是False This was the source of your problem.这是你问题的根源。

q1 = df[cols_range].quantile(0.25, axis=1)
q3 = df[cols_range].quantile(0.75, axis=1)
iqr = q3 - q1 # interquartile range

min_vals = df.min(axis=1).to_numpy().reshape(-1,1)
max_vals = (q3 + 3*iqr).to_numpy().reshape(-1,1)

Use >= and <= instead of > and < :使用>=<=代替><

df_filtered = df[(df >= min_vals) & (df <= max_vals)]
df_filtered

That works for me这对我行得通

df_filtered = df[range][df[range].ge(min, axis=0) & df[range].le(max, axis=0)]

If you run df[range] > min or df[range] < max you will see that the output is only false dataframe.如果你运行df[range] > mindf[range] < max你会看到 output 只是错误的 dataframe。 false & false = false , so no element will be taken from df and you get dataframe with Nan values only. false & false = false ,因此不会从df中获取任何元素,并且您会得到仅具有 Nan 值的 dataframe 。 Not sure why it's the case (comparing DataFrame and Series must work like that, I'm not an expert in pandas).不知道为什么会这样(比较 DataFrame 和 Series 必须这样工作,我不是熊猫专家)。 Instead, use methods le and ge (you need non-strict inequality, because you want to keep values like 0).相反,使用方法lege (你需要非严格的不等式,因为你想保持像 0 这样的值)。

BTW.顺便提一句。 try not to use variable names like range , min , max or other build-in functions or keywords尽量不要使用变量名称,如rangeminmax或其他内置函数或关键字

Use apply and where .使用applywhere

Your df:你的df:

>>> df
  MATRÍCULA   V1  V2     V3  V4  V5  V6  V7  V8     V9  V10  V11  V12
0  ANJSHD12  140   8  99992   0   0   0   0   1  99999    0    0    0
1  ANJSHD15   10   0      0   0   0   0   0   0      0    0    0    0
2  ANJSHD17   19  18     22  19  25  18  23  22     22   17   16   19

Using where within apply by column and reassigning your range.使用按列applywhere并重新分配您的范围。 (renamed with a trailing underscore to avoid overwriting the builtin) (用尾随下划线重命名以避免覆盖内置)

df[range_] = df[range_].apply(lambda col: col.where((col>=min_)&(col<=max_)))

output output

>>> df
      V1  V2    V3  V4  V5  V6  V7  V8    V9  V10  V11  V12
0  140.0   8   NaN   0   0   0   0   1   NaN    0    0    0
1    NaN   0   0.0   0   0   0   0   0   0.0    0    0    0
2   19.0  18  22.0  19  25  18  23  22  22.0   17   16   19

I think you'll benefit from choosing a different Interpolation method for your quantile calculation.我认为您将从为quantile计算选择不同的插值方法中受益。

# You overcomplicate how this works:
cols = df.columns[1:13]

# Note how I put `0.75` in `[]`, this changes how the output is returned.
q3 = df[cols].quantile([0.75], axis=1, interpolation='higher')

# Less than or equal to...
mask = (df[cols].le(q3.T.values)
   # There's no reason to have this next line if you're just going to use the min... 
      & df[cols].ge(df[cols].min(axis=1).to_frame().values))

# Overwrite outlier values:
df[cols] = df[cols][mask]
print(df)

Output: Output:

  MATRÍCULA     V1  V2    V3  V4   V5  V6   V7  V8    V9  V10  V11  V12
0  ANJSHD12  140.0   8   NaN   0  0.0   0  0.0   1   NaN    0    0    0
1  ANJSHD15    NaN   0   0.0   0  0.0   0  0.0   0   0.0    0    0    0
2  ANJSHD17   19.0  18  22.0  19  NaN  18  NaN  22  22.0   17   16   19

Assuming 'MATRÍCULA' are unique, We can simplify this all with a groupby:假设'MATRÍCULA'是唯一的,我们可以使用 groupby 来简化这一切:

df[cols] = (df.groupby('MATRÍCULA')[cols]
              .apply(lambda x: x[x.le(x.quantile([0.75], axis=1, interpolation='higher').values)]))
print(df)

# Output:
# Same as Above!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM