简体   繁体   中英

Filter the values in a variable in a dataframe which match a regular expression using grep in R

I have data which looks like this

data <- data.frame(
  ID_num = c("BGR9876", "BNG3421", "GTH4567", "YOP9824", "Child 1", "2JAZZ", "TYH7654"),
  date_created = "19/07/1983"
)

I would like to filter the dataframe so that I only keep the rows where ID_num follows the pattern ABC1234. I am new to using regular expressions in grep, and I am getting this wrong. This is what I am trying

data_clean <- data %>%
  filter(grep("[A-Z]{3}[1:9]{4}", ID_num))

Which gives me the error Error in filter_impl(.data, quo) : Argument 2 filter condition does not evaluate to a logical vector

This is my desired output

data_clean <- data.frame(
  ID_num = c("BGR9876", "BNG3421", "GTH4567", "YOP9824", "TYH7654"),
  date_created = "19/07/1983"
)

Thanks

The 1:9 should be 1-9 and it would be grepl along with ^ to specify the start of the string and $ for the end of the string

library(dplyr)
data %>%
   filter(grepl("^[A-Z]{3}[1-9]{4}$", ID_num))
#   ID_num date_created
#1 BGR9876   19/07/1983
#2 BNG3421   19/07/1983
#3 GTH4567   19/07/1983
#4 YOP9824   19/07/1983
#5 TYH7654   19/07/1983

filter expects a logical vector, grep returns numeric index while grepl return logical vector


Or if we want to use grep , use slice which expects numeric index

data %>%
   slice(grep("^[A-Z]{3}[1-9]{4}$", ID_num))

A similar option in tidyverse would be to use str_detect

library(stringr)
data %>%
    filter(str_detect(ID_num, "^[A-Z]{3}[1-9]{4}$"))

In base R , we can do

subset(data, grepl("^[A-Z]{3}[1-9]{4}$", ID_num))

Or with Extract

data[grepl("^[A-Z]{3}[1-9]{4}$", data$ID_num),]

Note that this will specifically find the pattern of 3 upper case letters followed by 4 digits, and not match

grepl("[A-Z]{3}[1-9]{4}", "ABGR9876923")
#[1] TRUE

grepl("^[A-Z]{3}[1-9]{4}$", "ABGR9876923")
#[1] FALSE

We can use grepl with the pattern

data[grepl("[A-Z]{3}\\d{4}", data$ID_num), ]

#   ID_num date_created
#1 BGR9876   19/07/1983
#2 BNG3421   19/07/1983
#3 GTH4567   19/07/1983
#4 YOP9824   19/07/1983
#7 TYH7654   19/07/1983

Or in filter

library(dplyr)
data %>% filter(grepl("[A-Z]{3}\\d{4}", ID_num))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM