简体   繁体   English

R - 根据前后值清理数据

[英]R - clean up data based on preceding and following values

I have got a table which is later on divided into multiple intervals based on multiple conditions.我有一张表格,后来根据多个条件分为多个间隔。 In some rare cases, I one or multiple rows which do not fall into the defined interval, so I'd like to preform some extra clean-up in the data.在极少数情况下,我的一行或多行不属于定义的间隔,因此我想对数据进行一些额外的清理。

For each group (name, location), if the row value in stop == 0, I need to count how many of those rows are in the interval.对于每个组(名称,位置),如果停止中的行值 == 0,我需要计算这些行中有多少在间隔中。 If that less then <3, I need to check how many continous rows are market as stop == 1 above and below the interval with zero value.如果小于<3,我需要检查有多少连续行是市场作为停止== 1 高于和低于具有零值的区间。 If the count of values with stop == 1 above & below == 1 then I need to change values in the intervals with zero to 1.如果停止 == 1 以上和以下 == 1 的值计数,那么我需要将间隔中的值更改为 0 到 1。

I hope the picture will make it more clear:我希望图片能更清楚:

在此处输入图像描述

df <- read.table(text="name location    stop
John    London  1
John    London  1
John    London  1
John    London  1
John    London  1
John    London  1
John    London  1
John    London  0
John    London  0
John    London  1
John    London  1
John    London  1
John    London  1
John    London  1
John    London  1
John    London  0
John    New_York    0
John    New_York    0
John    New_York    0
John    New_York    1
John    New_York    0
",header  = TRUE, stringsAsFactors = FALSE)

You could iterate over the rows, but it seems that all you want to do is replace all instances of 101 with 111 and 1001 with 1111 in stop .您可以遍历行,但似乎您想要做的只是将101的所有实例替换为111 ,并将1001的所有实例替换为stop中的1111 You can do this by turning the stop column to string and then make substitutions using gsub() :您可以通过将stop列转换为字符串然后使用gsub()进行替换来做到这一点:

stopString = paste0(df$stop, collapse = "")
stopString = gsub("101","111",stopString)
stopString = gsub("1001","1111",stopString)
df$stop = as.numeric(unlist(strsplit(stopString,"")))
> df
   name location stop
1  John   London    1
2  John   London    1
3  John   London    1
4  John   London    1
5  John   London    1
6  John   London    1
7  John   London    1
8  John   London    1
9  John   London    1
10 John   London    1
11 John   London    1
12 John   London    1
13 John   London    1
14 John   London    1
15 John   London    1
16 John   London    0
17 John New_York    0
18 John New_York    0
19 John New_York    0
20 John New_York    1
21 John New_York    0

Edit: grouping by name and location:编辑:按名称和位置分组:

df <- read.table(text="name location    stop
John    London  1
John    London  0
John    London  1
John    New_York    0
John    New_York    1
John    New_York    0
John    New_York    0
John    New_York    0
John    New_York    1
John    New_York    0
",header  = TRUE, stringsAsFactors = TRUE)

f <- function(x)
{
  stopString = paste0(x, collapse = "")
  stopString = gsub("101","111",stopString)
  stopString = gsub("1001","1111",stopString)
  as.numeric(unlist(strsplit(stopString,"")))
}

> df %>% dplyr::group_by(name, location) %>%
  dplyr::summarise(stop=stop, s=f(stop))
# A tibble: 10 x 4
# Groups:   name, location [2]
   name  location  stop     s
   <fct> <fct>    <int> <dbl>
 1 John  London       1     1
 2 John  London       0     1
 3 John  London       1     1
 4 John  New_York     0     0
 5 John  New_York     1     1
 6 John  New_York     0     0
 7 John  New_York     0     0
 8 John  New_York     0     0
 9 John  New_York     1     1
10 John  New_York     0     0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM