简体   繁体   English

列名称的 grep 模式

[英]grep pattern for column names

I have a data frame and want to grep column names that fit particular patterns.我有一个数据框,想要 grep 符合特定模式的列名。 I have four sets of patterns:我有四组模式:

# set 1 (underscore, no A)
  cat1_1
  cat12_12

# set 2 (underscore, A)
  cat4_4A
  cat18_18A

# set 3 (no underscore, no p)
  dog2
  dog12

# set 4 (no underscore, p)
  dog2p
  dog12p

My actual data frame contains different numbers of columns per set, but I am showing just two columns per set in this example for simplicity.我的实际数据框每组包含不同数量的列,但为了简单起见,我在此示例中每组只显示两列。

ex <- data.frame(cat1_1=c("1a", "1a"),
                 cat12_12=c("1b", "1b"),
                 cat4_4A=c("2a", "2a"),
                 cat18_18A=c("2b", "2b"),
                 dog2=c("3a", "3a"),
                 dog12=c("3b", "3b"),
                 dog2p=c("4a", "4a"),
                 dog12p=c("4b", "4c"))
ex
#      cat1_1 cat12_12 cat4_4A cat18_18A dog2 dog12 dog2p dog12p
#1     1a       1b      2a        2b   3a    3b    4a     4b
#2     1a       1b      2a        2b   3a    3b    4a     4c

I want to grep names(ex) so that I grab all set 1 variables, then separately, all set 2 variables, and so on.我想 grep names(ex)以便我获取所有 set 1 变量,然后分别获取所有 set 2 变量,依此类推。 So for instance, grep(PATTERN, names(ex)) for set 1 should return:因此,例如,集合 1 的grep(PATTERN, names(ex))应该返回:

[1] "cat1_1" "cat12_12"

I'd appreciate help with the grep pattern for each set.对于每组的 grep 模式,我将不胜感激。 One constraint is that I do not want to change any column names.一个约束是我不想更改任何列名。

Based on the example showed by the OP, if we need to find patterns in the colnames, that start ( ^ ) with 'cat' followed by one or more numbers ( \\d+ ) followed by an underscore ( \\_ ) followed by one or more numbers ('\d+') till the end of the string ( $ ), we get 'cat1_1', 'cat12_12'.根据 OP 显示的示例,如果我们需要在 colnames 中查找模式,则以 'cat' 开头( ^ ),后跟一个或多个数字( \\d+ ),然后是下划线( \\_ ),然后是一个或多个数字 ('\d+') 直到字符串 ( $ ) 的末尾,我们得到 'cat1_1'、'cat12_12'。

 grep('^cat\\d+\\_\\d+$', names(ex), value=TRUE)

Similar logic can be used for the other cases.类似的逻辑可以用于其他情况。

 grep('^cat\\d+\\_\\d+[A-Z]+$', names(ex), value= TRUE)
 grep('^dog\\d+$', names(ex), value=TRUE)
 grep('^dog\\d+[a-z]+$', names(ex), value=TRUE)

Or another option would be to split the column names by creating a grouping variable based on names(ex)或者另一种选择是通过创建基于名称的分组变量来splitnames(ex)

 split(names(ex), gsub('\\d+(?=\\_)|(?<=\\_)\\d+|(?<=[a-z])\\d+', 
                             '1', names(ex), perl=TRUE))
 #$cat1_1
 #[1] "cat1_1"   "cat12_12"

 #$cat1_1A
 #[1] "cat4_4A"   "cat18_18A"

 #$dog1
 #[1] "dog2"  "dog12"

 #$dog1p
 #[1] "dog2p"  "dog12p"

Consider using the begins with ^ and ends with $ regex :考虑使用以^开头并以$结尾的正则表达式

names(ex)[grep("^cat.*[0-9]$", names(ex))]

names(ex)[grep("^cat.*A$", names(ex))]

names(ex)[grep("^dog.*[0-9]$", names(ex))]

names(ex)[grep("^dog.*p$", names(ex))]

As an amendment to the good answers thus far, R has some "special" strings that might ease the transition into using regular expressions.作为对迄今为止好的答案的修正,R 有一些“特殊”字符串可以简化向使用正则表达式的过渡。 For example, [:digit:] will match any numerical value and [:alpha:] will match any alphabetical character.例如, [:digit:]将匹配任何数值,而[:alpha:]将匹配任何字母字符。

If we apply this to the four column name types that you are working with we get the following:如果我们将此应用于您正在使用的四种列名称类型,我们将得到以下结果:

grep("^cat[[:digit:]]+_[[:digit:]]+$", names(ex), value=TRUE)
# "cat1_1"   "cat12_12"

grep("^cat[[:digit:]]+_[[:digit:]]+A$", names(ex), value=TRUE)
# "cat4_4A"   "cat18_18A"

grep("^dog[[:digit:]]+$", names(ex), value=TRUE)
# "dog2"  "dog12"

grep("^dog[[:digit:]]+p$", names(ex), value=TRUE)
# "dog2p"  "dog12p"

Note that we have to enclose [:digit:] in another set of square brackets to properly delineate the range of values it represents, but at the very least I think it's a bit more readable to a newcomer than double escape characters such \\d (although at some point you'll get tired of typing the extra characters:D).请注意,我们必须将[:digit:]括在另一组方括号中以正确描述它所代表的值的范围,但至少我认为它比双转义字符(如\\d )对新手来说更易读(虽然在某些时候你会厌倦输入额外的字符:D)。

For a complete list of these "special" strings and other useful information about regular expressions in R, I would recommend checking out this link from the R base documentation.有关这些“特殊”字符串的完整列表以及有关 R 中正则表达式的其他有用信息,我建议您查看 R 基础文档中的此链接

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM