简体   繁体   English

用于过滤掉 R 数据框列的重复数字的正则表达式

[英]A regular expression to filter out repetitive numbers for R data frame columns

I have a data frame with many columns and rows and I need to filter based on the value of two columns (Lat and Lon).我有一个包含许多列和行的数据框,我需要根据两列(Lat 和 Lon)的值进行过滤。 I need a regular expression which我需要一个正则表达式

  1. Removes any row for which either the Lat or Lon column does not have at least three decimal places.删除 Lat 或 Lon 列没有至少三位小数的任何行。 So the first row (human) would be filtered, because even though Lon has three decimal places, Lat does not.所以第一行(人)将被过滤,因为即使 Lon 有三位小数,但 Lat 没有。
  2. Removes any row for which the decimal places are redundant.删除小数位多余的任何行。 What I mean by redundant is there are three repeats of the same number continuing to the end.我所说的冗余是指三个相同数量的重复一直持续到最后。 But if the redundancy starts after the third decimal, it doesn't matter.但如果冗余在小数点后第三位开始,那没关系。 And if the redundancy is eventually followed by a different number, it doesn't matter.如果冗余最终后面跟着一个不同的数字,那也没关系。
Type <-c("human","camera","ebird","museum", "specimen", "gbif")
Lat <- c(34.67, 34.66,34.6666666, 34.666582, 34.56666, 34.586666)
Lon <- c(9.888,9.88,9.8761,9.888064, 9.78888,9.318888)
x = data.frame(cbind(Type,Lat,Lon))

Here's how each row would fare under the regex:以下是每行在正则表达式下的表现:

  1. fails because Lat only has two decimal places, even though Lon passes.失败是因为 Lat 只有两位小数,即使 Lon 通过了。
  2. fails because both rows only have two decimal places失败,因为两行只有两位小数
  3. fails because Lat repeats the same value, starting at the first decimal place, and the repetition continues to the end of the number.失败是因为 Lat 重复相同的值,从第一个小数位开始,并且重复一直持续到数字的末尾。
  4. Passes the regex通过正则表达式
  5. Fails because the repetitive number values starts at the second decimal places and continues for at least 3 repetitions all the way to the end失败,因为重复数值从小数点后第二位开始,并一直持续至少 3 次重复直到结束
  6. Passes the regex通过正则表达式

So the resulting data frame from this regex filter would be:因此,此正则表达式过滤器生成的数据帧将是:

Type <-c("museum","gbif")
Lat <- c(34.666582, 34.586666)
Lon <- c(9.888064, 9.318888)
x = data.frame(cbind(Type,Lat,Lon))

The function below will output the desired dataframe that you want.下面的 function 将 output 您想要的 dataframe。 It accomplishes all of the requirements you stated above.它完成了您上面提到的所有要求。

check.expressions <- function(data){
  data$pass <- FALSE
  for(i in 1:nrow(data)){
    if(nchar(str_extract(x$Lon[i], "(?<=\\.).*")) < 3 | nchar(str_extract(x$Lat[i], "(?<=\\.).*")) < 3){
     next 
    } else {
      unlist(str_split(str_extract(x$Lon[i], "(?<=\\.).*" ), "")) -> lon
      unlist(str_split(str_extract(x$Lat[i], "(?<=\\.).*" ), "")) -> lat
      if(lon[1] == lon[2] && lon[2] == lon[3]){
        if(length(lon) > 3){
          if(lon[3] != lon[length(lon)]){
            data$pass[i] <- TRUE
            next
          } else {
            next
          }
        }
        next
      }
      if(lat[1] == lat[2] && lat[2] == lat[3]){
        if(length(lat) > 3){
          if(lat[3] != lat[length(lat)]){
            data$pass[i] <- TRUE
            next
          } else {
            next
          }
        }
        next
      }
      if(length(lon) > 4){
        if(lon[2] == lon[3] && lon[3] == lon[4]){
          if(lon[4] != lon[length(lon)]){
            data$pass[i] <- TRUE
            next
          } else {
            next
          }
        }
      }
      if(length(lat) > 4){
        if(lat[2] == lat[3] && lat[3] == lat[4]){
          if(lat[4] != lat[length(lat)]){
            data$pass[i] <- TRUE
            next
          }
        }
      }
      data$pass[i] <- TRUE
    }
  }
  data[data$pass == TRUE, ] -> data
  return(data)
}

The function call being just: function 调用只是:

check.expressions(x) -> x.out

which would produce:这将产生:

> check.expressions(x) -> x.out
> x.out
    Type       Lat      Lon pass
4 museum 34.666582 9.888064 TRUE
6   gbif 34.586666 9.318888 TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM