在 R 中按字符串格式过滤

Question

I have an ID column that should always be formatted ABCDE123 - Five letters and three numbers, no gap no symbols.我有一个 ID 列，它应该始终采用ABCDE123格式 - 五个字母和三个数字，没有间隙没有符号。

I know for sure there are a number of rows that don't correctly follow this format.我确信有许多行没有正确遵循这种格式。 Is it possible to filter by the string format in R, so that I can identify those rows and review them?是否可以按 R 中的字符串格式进行过滤，以便我可以识别这些行并查看它们？

Tidyverse is preferred, but any solution would be helpful! Tidyverse 是首选，但任何解决方案都会有所帮助！

Answer 1

If these are 5 upper case letters followed by 3 digits, specify regex to match 5 upper case letters [AZ]{5} from the start ( ^ ) of the string followed by 3 digits ( [0-9]{3} ) at the end ( $ ) of the string in str_detect to return a logical vector which is used in filter ing the rows of the data如果这些是 5 个大写字母后跟 3 个数字，请指定正则表达式以匹配从字符串开头 ( ^ ) 开始的 5 个大写字母[AZ]{5}后跟 3 个数字 ( [0-9]{3} ) 在str_detect字符串的结尾 ( $ ) 返回一个逻辑向量，用于filter数据行

library(dplyr)
library(stringr)
df1 %>%
    filter(str_detect(ID, '^[A-Z]{5}[0-9]{3}$'))

If these rows should be removed, specify negate = TRUE in str_detect如果应删除这些行，请在str_detect指定negate = TRUE

df1 %>%
    filter(str_detect(ID, '^[A-Z]{5}[0-9]{3}$', negate = TRUE))

Or as @BenBolker mentioned in the comments [[:upper:]]{5} would be more generic compared to [AZ]{5}或者正如评论中提到的@BenBolker [[:upper:]]{5}与[AZ]{5}相比会更通用

在 R 中按字符串格式过滤

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-07-18 22:08:42

在 R 中按字符串格式过滤

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-07-18 22:08:42

解决方案1
2 已采纳 2021-07-18 22:08:42