简体   繁体   English

如何从R中的数据帧中删除错误的数据

[英]how to remove an erroneous data from data frame in R

I know this is a VERY general title, but bear with me, this is more about data manipulation than data cleaning. 我知道这是一个非常笼统的标题,但是请忍受,这更多的是关于数据操作而不是数据清理。

My data set is a 1-min precipitation data. 我的数据集是1分钟的降水量数据。

Allow me to set up a dummy data: 请允许我设置一个虚拟数据:

a<-data.frame(matrix(c("00:00", "00:01","00:02", "00:03", 
"00:04","00:05","00:06","00:07","00:08","00:09","00:10",
"00:11","00:12", 1.2, 1.4 ,1.4, 1.5, 0.7, 0.8, 0.69, 1.2, 
1.0, 1.3, 0.6, 0.2, 0, 0,0, 0 , 0, 0 , 0 , 96.6, 0 , 0 , 
0 , 0, 0 ,0),ncol=3))

names(a)<-c("time","day1","day2")
a$time<-as.POSIXct(a$time, format="%Y%m%d %H:%M")

So now, the dataframe now looks like this 所以现在,数据框现在看起来像这样

                  time day1 day2
1  2018-06-06 00:00:00  1.2    0
2  2018-06-06 00:01:00  1.4    0
3  2018-06-06 00:02:00  1.4    0
4  2018-06-06 00:03:00  1.5    0
5  2018-06-06 00:04:00  0.7    0
6  2018-06-06 00:05:00  0.8    0
7  2018-06-06 00:06:00 0.69 96.6
8  2018-06-06 00:07:00  1.2    0
9  2018-06-06 00:08:00    1    0
10 2018-06-06 00:09:00  1.3    0
11 2018-06-06 00:10:00  0.6    0
12 2018-06-06 00:11:00  0.2    0
13 2018-06-06 00:12:00    0    0

There is an odd data 96.6 there.I'd like to remove it. 那里有一个奇数数据96.6。我想删除它。

I cant use outlier method because this is rainfall dataset, so the value of 96.6mm is possible if the adjacent rows show similar or close number like in day 1, but it is not possible to rain 96.6mm for just 1-min, so it is possible that this data is an error. 我不能使用离群值方法,因为这是降雨数据集,因此,如果相邻行像第1天一样显示相似或接近的数字,则96.6mm的值是可能的,但不可能仅在1分钟内下雨96.6mm,因此此数据可能是错误的。

But how do I instruct the computer to read the adjacent rows, and if there are over 10 rows of 0, then remove any values > 50 mm? 但是,如何指示计算机读取相邻的行,并且如果有超过10行的0,那么请删除任何大于50毫米的值?

note: the average rainfall value per min is only about 1-2mm. 注意:每分钟的平均降雨量只有1-2mm左右。

Addressing your specific question "But how do I instruct the computer to read the adjacent rows, and if there are over 10 rows of 0, then remove any values > 50 mm?" 解决您的特定问题“但是,如何指示计算机读取相邻的行,如果有10行以上的0,那么请删除任何大于50毫米的值?” For my answer, I am only looking at the previous 5 rows. 对于我的答案,我仅查看前5行。 I also didn't remove the values, but you can set them to NA instead of 0 if you need. 我也没有删除这些值,但是您可以根据需要将它们设置为NA而不是0。

Data 数据

a<-data.frame( time = c("00:00", "00:01","00:02", "00:03", 
                       "00:04","00:05","00:06","00:07","00:08","00:09","00:10",
                       "00:11","00:12","00:13","00:14","00:15"),
               day1 = c(1.2, 1.4 ,1.4, 1.5, 0.7, 0.8, 0.69, 1.2, 
                       1.0, 1.3, 0.6, 0.2, 0, 0, 0, 0),
               day2 = c(0,0, 0 , 0, 0 , 0 , 96.6, 0 , 0 , 
                       0 , 0, 0 ,0, 60, 30, 600))

                  time day1 day2
1  2018-06-06 00:00:00 1.20  0.0
2  2018-06-06 00:01:00 1.40  0.0
3  2018-06-06 00:02:00 1.40  0.0
4  2018-06-06 00:03:00 1.50  0.0
5  2018-06-06 00:04:00 0.70  0.0
6  2018-06-06 00:05:00 0.80  0.0
7  2018-06-06 00:06:00 0.69 96.6
8  2018-06-06 00:07:00 1.20  0.0
9  2018-06-06 00:08:00 1.00  0.0
10 2018-06-06 00:09:00 1.30  0.0
11 2018-06-06 00:10:00 0.60  0.0
12 2018-06-06 00:11:00 0.20  0.0
13 2018-06-06 00:12:00 0.00  0.0
14 2018-06-06 00:13:00 0.00 60.0
15 2018-06-06 00:14:00 0.00 30.0
16 2018-06-06 00:15:00 0.00 600.0

I added a few of data points at the end to see what would happen if there were two errors in a row (or two that were close together). 我在末尾添加了一些数据点,以查看如果连续出现两个错误(或两个错误并发)会发生什么情况。

Solution

library(RcppRoll)
a %>% 
  transmute(time, day1, day2 = ifelse(lag(roll_sumr(day2, 5)) == 0 & day2 > 50, 0, day2))

Output 产量

                  time day1 day2
1  2018-06-06 00:00:00 1.20    0
2  2018-06-06 00:01:00 1.40    0
3  2018-06-06 00:02:00 1.40    0
4  2018-06-06 00:03:00 1.50    0
5  2018-06-06 00:04:00 0.70    0
6  2018-06-06 00:05:00 0.80    0
7  2018-06-06 00:06:00 0.69    0
8  2018-06-06 00:07:00 1.20    0
9  2018-06-06 00:08:00 1.00    0
10 2018-06-06 00:09:00 1.30    0
11 2018-06-06 00:10:00 0.60    0
12 2018-06-06 00:11:00 0.20    0
13 2018-06-06 00:12:00 0.00    0
14 2018-06-06 00:13:00 0.00   30
15 2018-06-06 00:14:00 0.00  600

If you want to do some sort of rolling distribution, there are some things to consider, but you could code it with something like this: 如果您想进行某种滚动分布,则需要考虑一些事情,但是可以使用以下代码进行编码:

a %>% 
  transmute(time, day1, 
            day2 = ifelse(day2 > 3*lag(roll_sdr(day2, 5)) & !is.na(lag(roll_sdr(day2, 5))), 
                          lag(roll_meanr(day2, 5)), 
                          day2))

Output 产量

                  time day1 day2
1  2018-06-06 00:00:00 1.20    0
2  2018-06-06 00:01:00 1.40    0
3  2018-06-06 00:02:00 1.40    0
4  2018-06-06 00:03:00 1.50    0
5  2018-06-06 00:04:00 0.70    0
6  2018-06-06 00:05:00 0.80    0
7  2018-06-06 00:06:00 0.69    0
8  2018-06-06 00:07:00 1.20    0
9  2018-06-06 00:08:00 1.00    0
10 2018-06-06 00:09:00 1.30    0
11 2018-06-06 00:10:00 0.60    0
12 2018-06-06 00:11:00 0.20    0
13 2018-06-06 00:12:00 0.00    0
14 2018-06-06 00:13:00 0.00    0
15 2018-06-06 00:14:00 0.00   30
16 2018-06-06 00:15:00 0.00   18

You see that it is finding the incorrect 96.6 and changing it to the mean of the previous 5 values (which is 0). 您会看到它找到了不正确的96.6,并将其更改为前5个值(即0)的平均值。 For the 60 value in day 2, it is doing the same thing. 对于第2天的60值,它正在执行相同的操作。 The 30 does not get changed because it is not more than 3 standard deviations of the previous 5 values. 30不变,因为它不超过前5个值的3个标准偏差。 The 600 is greater than 3 standard deviations above the previous 5 values so it changes it to the mean of the previous 5 values. 600大于先前5个值的3个标准偏差,因此它将其更改为先前5个值的平均值。 You may need to tweek/iterate this procedure to get what you want. 您可能需要tweek /重复此过程以获得所需的信息。

You can use make use of diff in base R. Define a function with a threshold and check with which to see what errors should be removed. 你可以用化妆用的diff在基地R.定义一个函数与阈值,并检查which ,看看应删除哪些误区。 The rows will not be deleted, but the error value will receive it's previous value instead. 这些行不会被删除,但错误值将改为接收其先前的值。

flattenSpikes <- function(x, threshold) {
  diffprev <- diff(x)
  x[which(diffprev > threshold) + 1] <- x[which(diffprev > threshold)]
  return(x)
}

a[,-1] <- mapply(flattenSpikes, a[,-1], 50)

a
#    time                day1    day2
# 1  2018-06-06 00:00:00 1.20    0
# 2  2018-06-06 00:01:00 1.40    0
# 3  2018-06-06 00:02:00 1.40    0
# 4  2018-06-06 00:03:00 1.50    0
# 5  2018-06-06 00:04:00 0.70    0
# 6  2018-06-06 00:05:00 0.80    0
# 7  2018-06-06 00:06:00 0.69    0
# 8  2018-06-06 00:07:00 1.20    0
# 9  2018-06-06 00:08:00 1.00    0
# 10 2018-06-06 00:09:00 1.30    0
# 11 2018-06-06 00:10:00 0.60    0
# 12 2018-06-06 00:11:00 0.20    0
# 13 2018-06-06 00:12:00 0.00    0

Data 数据

a<- structure(list(time = c("00:00", "00:01", "00:02", "00:03", "00:04", 
                               "00:05", "00:06", "00:07", "00:08", "00:09", "00:10", "00:11", 
                               "00:12"), day1 = c(1.2, 1.4, 1.4, 1.5, 0.7, 0.8, 0.69, 1.2, 1, 
                                                  1.3, 0.6, 0.2, 0), day2 = c(0, 0, 0, 0, 0, 0, 96.6, 0, 0, 0, 
                                                                              0, 0, 0)), .Names = c("time", "day1", "day2"), row.names = c(NA, 
                                                                                                                                           -13L), class = "data.frame")

a$time<-as.POSIXct(a$time, format="%H:%M")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM