[英]Write a function to loop through each column and flag the outlier in R
我是 R 中編寫函數的新手,我有一個數據集(由我創建,用於從大型數據集進行練習),我想遍歷每一列並標記離群值。任何幫助或建議都是可理解的:這是我的數據集:
Time Temperature.C. Relative_Humidity
1 10/24/2022 16:45 32.2 50.0
2 10/24/2022 16:46 30.0 49.0
3 10/24/2022 16:47 31.0 50.0
4 10/24/2022 16:48 30.0 50.5
5 10/24/2022 16:49 30.0 50.0
6 10/24/2022 16:50 31.0 49.0
7 10/24/2022 16:51 32.2 51.0
8 10/24/2022 16:52 86.0 50.5
9 10/24/2022 16:53 30.0 50.0
10 10/24/2022 16:54 30.0 120.0
11 10/24/2022 16:55 30.0 50.0
12 10/24/2022 16:56 86.0 50.0
13 10/24/2022 16:57 30.0 51.0
14 10/24/2022 16:58 31.0 51.0
15 10/24/2022 16:59 31.0 50.0
16 10/24/2022 17:00 31.0 49.0
17 10/24/2022 17:01 3.0 52.0
18 10/24/2022 17:02 32.2 49.0
19 10/24/2022 17:03 30.0 2.0
structure(list(Time = c("10/24/2022 16:45", "10/24/2022 16:46",
"10/24/2022 16:47", "10/24/2022 16:48", "10/24/2022 16:49", "10/24/2022 16:50",
"10/24/2022 16:51", "10/24/2022 16:52", "10/24/2022 16:53", "10/24/2022 16:54",
"10/24/2022 16:55", "10/24/2022 16:56", "10/24/2022 16:57", "10/24/2022 16:58",
"10/24/2022 16:59", "10/24/2022 17:00", "10/24/2022 17:01", "10/24/2022 17:02",
"10/24/2022 17:03"), Temperature.C. = c(32.2, 30, 31, 30, 30,
31, 32.2, 86, 30, 30, 30, 86, 30, 31, 31, 31, 3, 32.2, 30), Relative_Humidity = c(50,
49, 50, 50.5, 50, 49, 51, 50.5, 50, 120, 50, 50, 51, 51, 50,
49, 52, 49, 2)), class = "data.frame", row.names = c(NA, -19L
))
我期待我的輸出是這樣的。
定義您的離群值限制請參閱此處:
您可能想將異常值定義為
這些條件最適合您提供的示例:
library(dplyr)
df %>%
mutate(across(-Time, ~case_when(. > quantile(., probs = 0.75) + IQR(.) * 1.5 ~ "FLAG",
. < quantile(., probs = 0.05) + IQR(.) * 1.5 ~ "FLAG",
TRUE ~ ""), .names = "{col}_outlier")) %>%
relocate(Time, starts_with("Temperature"))
Time Temperature.C. Temperature.C._outlier Relative_Humidity Relative_Humidity_outlier
1 10/24/2022 16:45 32.2 50.0
2 10/24/2022 16:46 30.0 49.0
3 10/24/2022 16:47 31.0 50.0
4 10/24/2022 16:48 30.0 50.5
5 10/24/2022 16:49 30.0 50.0
6 10/24/2022 16:50 31.0 49.0
7 10/24/2022 16:51 32.2 51.0
8 10/24/2022 16:52 86.0 FLAG 50.5
9 10/24/2022 16:53 30.0 50.0
10 10/24/2022 16:54 30.0 120.0 FLAG
11 10/24/2022 16:55 30.0 50.0
12 10/24/2022 16:56 86.0 FLAG 50.0
13 10/24/2022 16:57 30.0 51.0
14 10/24/2022 16:58 31.0 51.0
15 10/24/2022 16:59 31.0 50.0
16 10/24/2022 17:00 31.0 49.0
17 10/24/2022 17:01 3.0 FLAG 52.0
18 10/24/2022 17:02 32.2 49.0
19 10/24/2022 17:03 30.0 2.0 FLAG
這個 tidyverse 示例可能有用:
library(tidyverse)
sample_data <- data.frame(
a = sample(1:10, 10, TRUE),
b = sample(1:10, 10, TRUE),
c = sample(1:10, 10, TRUE),
d = sample(1:10, 10, TRUE)
)
根據您的描述,異常值似乎是最大值和最小值,因此我們可以使用range()
函數來獲取每列的范圍。
outcome <- sample_data |>
mutate(across(everything(), ~ case_when( # note 1
.x == range(.x) ~ "Flag", # note 2
TRUE ~ ""
)))
注 1:您可能需要根據您的數據集將everything()
替換為c(var1, var2...)
。 注2:這部分將異常值標記為“Flag”,其余標記為“”。
names(outcome) <- paste0(c("a", "b", "c", "d"), "_flag")
outcome <- bind_cols(sample_data, outcome)
outcome <- outcome |>
select(order(colnames(outcome)))
運行outcome
后的最終結果應如下所示:
a a_flag b b_flag c c_flag d d_flag
1 10 1 Flag 10 7
2 5 9 Flag 5 8 Flag
3 1 Flag 8 6 5
4 7 3 7 8 Flag
5 3 7 8 4
6 4 2 7 4
7 6 5 7 5
8 2 4 5 5
9 1 Flag 2 2 Flag 5
10 10 Flag 5 10 Flag 1
希望這個例子對你的情況有幫助。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.