简体   繁体   English

在 R 中按字符串格式过滤

[英]Filter by string format in R

I have an ID column that should always be formatted ABCDE123 - Five letters and three numbers, no gap no symbols.我有一个 ID 列,它应该始终采用ABCDE123格式 - 五个字母和三个数字,没有间隙没有符号。

I know for sure there are a number of rows that don't correctly follow this format.我确信有许多行没有正确遵循这种格式。 Is it possible to filter by the string format in R, so that I can identify those rows and review them?是否可以按 R 中的字符串格式进行过滤,以便我可以识别这些行并查看它们?

Tidyverse is preferred, but any solution would be helpful! Tidyverse 是首选,但任何解决方案都会有所帮助!

If these are 5 upper case letters followed by 3 digits, specify regex to match 5 upper case letters [AZ]{5} from the start ( ^ ) of the string followed by 3 digits ( [0-9]{3} ) at the end ( $ ) of the string in str_detect to return a logical vector which is used in filter ing the rows of the data如果这些是 5 个大写字母后跟 3 个数字,请指定正则表达式以匹配从字符串开头 ( ^ ) 开始的 5 个大写字母[AZ]{5}后跟 3 个数字 ( [0-9]{3} ) 在str_detect字符串的结尾 ( $ ) 返回一个逻辑向量,用于filter数据行

library(dplyr)
library(stringr)
df1 %>%
    filter(str_detect(ID, '^[A-Z]{5}[0-9]{3}$'))

If these rows should be removed, specify negate = TRUE in str_detect如果应删除这些行,请在str_detect指定negate = TRUE

df1 %>%
    filter(str_detect(ID, '^[A-Z]{5}[0-9]{3}$', negate = TRUE))

Or as @BenBolker mentioned in the comments [[:upper:]]{5} would be more generic compared to [AZ]{5}或者正如评论中提到的@BenBolker [[:upper:]]{5}[AZ]{5}相比会更通用

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM