列名称的 grep 模式

Question

I have a data frame and want to grep column names that fit particular patterns.我有一个数据框，想要 grep 符合特定模式的列名。 I have four sets of patterns:我有四组模式：

# set 1 (underscore, no A)
  cat1_1
  cat12_12

# set 2 (underscore, A)
  cat4_4A
  cat18_18A

# set 3 (no underscore, no p)
  dog2
  dog12

# set 4 (no underscore, p)
  dog2p
  dog12p

My actual data frame contains different numbers of columns per set, but I am showing just two columns per set in this example for simplicity.我的实际数据框每组包含不同数量的列，但为了简单起见，我在此示例中每组只显示两列。

ex <- data.frame(cat1_1=c("1a", "1a"),
                 cat12_12=c("1b", "1b"),
                 cat4_4A=c("2a", "2a"),
                 cat18_18A=c("2b", "2b"),
                 dog2=c("3a", "3a"),
                 dog12=c("3b", "3b"),
                 dog2p=c("4a", "4a"),
                 dog12p=c("4b", "4c"))
ex
#      cat1_1 cat12_12 cat4_4A cat18_18A dog2 dog12 dog2p dog12p
#1     1a       1b      2a        2b   3a    3b    4a     4b
#2     1a       1b      2a        2b   3a    3b    4a     4c

I want to grep names(ex) so that I grab all set 1 variables, then separately, all set 2 variables, and so on.我想 grep names(ex)以便我获取所有 set 1 变量，然后分别获取所有 set 2 变量，依此类推。 So for instance, grep(PATTERN, names(ex)) for set 1 should return:因此，例如，集合 1 的grep(PATTERN, names(ex))应该返回：

[1] "cat1_1" "cat12_12"

I'd appreciate help with the grep pattern for each set.对于每组的 grep 模式，我将不胜感激。 One constraint is that I do not want to change any column names.一个约束是我不想更改任何列名。

Answer 1

Based on the example showed by the OP, if we need to find patterns in the colnames, that start ( ^ ) with 'cat' followed by one or more numbers ( \\d+ ) followed by an underscore ( \\_ ) followed by one or more numbers ('\d+') till the end of the string ( $ ), we get 'cat1_1', 'cat12_12'.根据 OP 显示的示例，如果我们需要在 colnames 中查找模式，则以 'cat' 开头（ ^ ），后跟一个或多个数字（ \\d+ ），然后是下划线（ \\_ ），然后是一个或多个数字 ('\d+') 直到字符串 ( $ ) 的末尾，我们得到 'cat1_1'、'cat12_12'。

 grep('^cat\\d+\\_\\d+$', names(ex), value=TRUE)

Similar logic can be used for the other cases.类似的逻辑可以用于其他情况。

 grep('^cat\\d+\\_\\d+[A-Z]+$', names(ex), value= TRUE)
 grep('^dog\\d+$', names(ex), value=TRUE)
 grep('^dog\\d+[a-z]+$', names(ex), value=TRUE)

Or another option would be to split the column names by creating a grouping variable based on names(ex)或者另一种选择是通过创建基于名称的分组变量来split列names(ex)

 split(names(ex), gsub('\\d+(?=\\_)|(?<=\\_)\\d+|(?<=[a-z])\\d+', 
                             '1', names(ex), perl=TRUE))
 #$cat1_1
 #[1] "cat1_1"   "cat12_12"

 #$cat1_1A
 #[1] "cat4_4A"   "cat18_18A"

 #$dog1
 #[1] "dog2"  "dog12"

 #$dog1p
 #[1] "dog2p"  "dog12p"

Answer 2

Consider using the begins with ^ and ends with $ regex :考虑使用以^开头并以$结尾的正则表达式：

names(ex)[grep("^cat.*[0-9]$", names(ex))]

names(ex)[grep("^cat.*A$", names(ex))]

names(ex)[grep("^dog.*[0-9]$", names(ex))]

names(ex)[grep("^dog.*p$", names(ex))]

Answer 3

As an amendment to the good answers thus far, R has some "special" strings that might ease the transition into using regular expressions.作为对迄今为止好的答案的修正，R 有一些“特殊”字符串可以简化向使用正则表达式的过渡。 For example, [:digit:] will match any numerical value and [:alpha:] will match any alphabetical character.例如， [:digit:]将匹配任何数值，而[:alpha:]将匹配任何字母字符。

If we apply this to the four column name types that you are working with we get the following:如果我们将此应用于您正在使用的四种列名称类型，我们将得到以下结果：

grep("^cat[[:digit:]]+_[[:digit:]]+$", names(ex), value=TRUE)
# "cat1_1"   "cat12_12"

grep("^cat[[:digit:]]+_[[:digit:]]+A$", names(ex), value=TRUE)
# "cat4_4A"   "cat18_18A"

grep("^dog[[:digit:]]+$", names(ex), value=TRUE)
# "dog2"  "dog12"

grep("^dog[[:digit:]]+p$", names(ex), value=TRUE)
# "dog2p"  "dog12p"

Note that we have to enclose [:digit:] in another set of square brackets to properly delineate the range of values it represents, but at the very least I think it's a bit more readable to a newcomer than double escape characters such \\d (although at some point you'll get tired of typing the extra characters:D).请注意，我们必须将[:digit:]括在另一组方括号中以正确描述它所代表的值的范围，但至少我认为它比双转义字符（如\\d ）对新手来说更易读（虽然在某些时候你会厌倦输入额外的字符：D）。

For a complete list of these "special" strings and other useful information about regular expressions in R, I would recommend checking out this link from the R base documentation.有关这些“特殊”字符串的完整列表以及有关 R 中正则表达式的其他有用信息，我建议您查看 R 基础文档中的此链接。

列名称的 grep 模式

问题描述

3 个解决方案

解决方案1
2 已采纳 2015-12-30 16:50:40

解决方案2
1 2015-12-30 16:56:06

解决方案3
0 2015-12-30 17:37:03

列名称的 grep 模式

问题描述

3 个解决方案

解决方案1 2 已采纳 2015-12-30 16:50:40

解决方案2 1 2015-12-30 16:56:06

解决方案3 0 2015-12-30 17:37:03

解决方案1
2 已采纳 2015-12-30 16:50:40

解决方案2
1 2015-12-30 16:56:06

解决方案3
0 2015-12-30 17:37:03