[英]Filter by string format in R
I have an ID column that should always be formatted ABCDE123
- Five letters and three numbers, no gap no symbols.我有一个 ID 列,它应该始终采用ABCDE123
格式 - 五个字母和三个数字,没有间隙没有符号。
I know for sure there are a number of rows that don't correctly follow this format.我确信有许多行没有正确遵循这种格式。 Is it possible to filter by the string format in R, so that I can identify those rows and review them?是否可以按 R 中的字符串格式进行过滤,以便我可以识别这些行并查看它们?
Tidyverse is preferred, but any solution would be helpful! Tidyverse 是首选,但任何解决方案都会有所帮助!
If these are 5 upper case letters followed by 3 digits, specify regex to match 5 upper case letters [AZ]{5}
from the start ( ^
) of the string followed by 3 digits ( [0-9]{3}
) at the end ( $
) of the string in str_detect
to return a logical vector which is used in filter
ing the rows of the data如果这些是 5 个大写字母后跟 3 个数字,请指定正则表达式以匹配从字符串开头 ( ^
) 开始的 5 个大写字母[AZ]{5}
后跟 3 个数字 ( [0-9]{3}
) 在str_detect
字符串的结尾 ( $
) 返回一个逻辑向量,用于filter
数据行
library(dplyr)
library(stringr)
df1 %>%
filter(str_detect(ID, '^[A-Z]{5}[0-9]{3}$'))
If these rows should be removed, specify negate = TRUE
in str_detect
如果应删除这些行,请在str_detect
指定negate = TRUE
df1 %>%
filter(str_detect(ID, '^[A-Z]{5}[0-9]{3}$', negate = TRUE))
Or as @BenBolker mentioned in the comments [[:upper:]]{5}
would be more generic compared to [AZ]{5}
或者正如评论中提到的@BenBolker [[:upper:]]{5}
与[AZ]{5}
相比会更通用
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.