[英]grep pattern for column names
I have a data frame and want to grep column names that fit particular patterns.我有一个数据框,想要 grep 符合特定模式的列名。 I have four sets of patterns:我有四组模式:
# set 1 (underscore, no A)
cat1_1
cat12_12
# set 2 (underscore, A)
cat4_4A
cat18_18A
# set 3 (no underscore, no p)
dog2
dog12
# set 4 (no underscore, p)
dog2p
dog12p
My actual data frame contains different numbers of columns per set, but I am showing just two columns per set in this example for simplicity.我的实际数据框每组包含不同数量的列,但为了简单起见,我在此示例中每组只显示两列。
ex <- data.frame(cat1_1=c("1a", "1a"),
cat12_12=c("1b", "1b"),
cat4_4A=c("2a", "2a"),
cat18_18A=c("2b", "2b"),
dog2=c("3a", "3a"),
dog12=c("3b", "3b"),
dog2p=c("4a", "4a"),
dog12p=c("4b", "4c"))
ex
# cat1_1 cat12_12 cat4_4A cat18_18A dog2 dog12 dog2p dog12p
#1 1a 1b 2a 2b 3a 3b 4a 4b
#2 1a 1b 2a 2b 3a 3b 4a 4c
I want to grep names(ex)
so that I grab all set 1 variables, then separately, all set 2 variables, and so on.我想 grep names(ex)
以便我获取所有 set 1 变量,然后分别获取所有 set 2 变量,依此类推。 So for instance, grep(PATTERN, names(ex))
for set 1 should return:因此,例如,集合 1 的grep(PATTERN, names(ex))
应该返回:
[1] "cat1_1" "cat12_12"
I'd appreciate help with the grep pattern for each set.对于每组的 grep 模式,我将不胜感激。 One constraint is that I do not want to change any column names.一个约束是我不想更改任何列名。
Based on the example showed by the OP, if we need to find patterns in the colnames, that start ( ^
) with 'cat' followed by one or more numbers ( \\d+
) followed by an underscore ( \\_
) followed by one or more numbers ('\d+') till the end of the string ( $
), we get 'cat1_1', 'cat12_12'.根据 OP 显示的示例,如果我们需要在 colnames 中查找模式,则以 'cat' 开头( ^
),后跟一个或多个数字( \\d+
),然后是下划线( \\_
),然后是一个或多个数字 ('\d+') 直到字符串 ( $
) 的末尾,我们得到 'cat1_1'、'cat12_12'。
grep('^cat\\d+\\_\\d+$', names(ex), value=TRUE)
Similar logic can be used for the other cases.类似的逻辑可以用于其他情况。
grep('^cat\\d+\\_\\d+[A-Z]+$', names(ex), value= TRUE)
grep('^dog\\d+$', names(ex), value=TRUE)
grep('^dog\\d+[a-z]+$', names(ex), value=TRUE)
Or another option would be to split
the column names by creating a grouping variable based on names(ex)
或者另一种选择是通过创建基于名称的分组变量来split
列names(ex)
split(names(ex), gsub('\\d+(?=\\_)|(?<=\\_)\\d+|(?<=[a-z])\\d+',
'1', names(ex), perl=TRUE))
#$cat1_1
#[1] "cat1_1" "cat12_12"
#$cat1_1A
#[1] "cat4_4A" "cat18_18A"
#$dog1
#[1] "dog2" "dog12"
#$dog1p
#[1] "dog2p" "dog12p"
As an amendment to the good answers thus far, R has some "special" strings that might ease the transition into using regular expressions.作为对迄今为止好的答案的修正,R 有一些“特殊”字符串可以简化向使用正则表达式的过渡。 For example, [:digit:]
will match any numerical value and [:alpha:]
will match any alphabetical character.例如, [:digit:]
将匹配任何数值,而[:alpha:]
将匹配任何字母字符。
If we apply this to the four column name types that you are working with we get the following:如果我们将此应用于您正在使用的四种列名称类型,我们将得到以下结果:
grep("^cat[[:digit:]]+_[[:digit:]]+$", names(ex), value=TRUE)
# "cat1_1" "cat12_12"
grep("^cat[[:digit:]]+_[[:digit:]]+A$", names(ex), value=TRUE)
# "cat4_4A" "cat18_18A"
grep("^dog[[:digit:]]+$", names(ex), value=TRUE)
# "dog2" "dog12"
grep("^dog[[:digit:]]+p$", names(ex), value=TRUE)
# "dog2p" "dog12p"
Note that we have to enclose [:digit:]
in another set of square brackets to properly delineate the range of values it represents, but at the very least I think it's a bit more readable to a newcomer than double escape characters such \\d
(although at some point you'll get tired of typing the extra characters:D).请注意,我们必须将[:digit:]
括在另一组方括号中以正确描述它所代表的值的范围,但至少我认为它比双转义字符(如\\d
)对新手来说更易读(虽然在某些时候你会厌倦输入额外的字符:D)。
For a complete list of these "special" strings and other useful information about regular expressions in R, I would recommend checking out this link from the R base documentation.有关这些“特殊”字符串的完整列表以及有关 R 中正则表达式的其他有用信息,我建议您查看 R 基础文档中的此链接。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.