[英]How to drop columns by name pattern in R?
I have this dataframe:我有这个数据框:
state county city region mmatrix X1 X2 X3 A1 A2 A3 B1 B2 B3 C1 C2 C3
1 1 1 1 111010 1 0 0 2 20 200 Push 8 12 NA NA NA
1 2 1 1 111010 1 0 0 4 NA 400 Shove 9 NA
Now I want to exclude columns whose names end with a certain string, say "1" (ie A1 and B1).现在我想排除名称以某个字符串结尾的列,比如“1”(即 A1 和 B1)。 I wrote this code:
我写了这段代码:
df_redacted <- df[, -grep("\\1$", colnames(df))]
However, this seems to delete every column.但是,这似乎删除了每一列。 How can I modify the code so that it only deletes the columns that matches the pattern (ie ends with "3" or any other string)?
如何修改代码,使其仅删除与模式匹配的列(即以“3”或任何其他字符串结尾)?
The solution has to be able to handle a dataframe with has both numerical and categorical values.解决方案必须能够处理具有数值和分类值的数据帧。
I found a simple answer using dplyr
/ tidyverse
.我使用
dplyr
/ tidyverse
找到了一个简单的答案。 If your colnames
contain "This", then all variables containing "This" will be dropped.如果您的
colnames
包含“This”,则所有包含“This”的变量都将被删除。
library(dplyr)
df_new <- df %>% select(-contains("This"))
Your code works like a charm if I apply it to a minimal example and just search for the string "A":如果我将它应用到一个最小的例子并且只搜索字符串“A”,你的代码就像一个魅力:
df <- data.frame(ID = 1:10,
A1 = rnorm(10),
A2 = rnorm(10),
B1 = letters[1:10],
B2 = letters[11:20])
df[, -grep("A", colnames(df))]
So your problem is more a regular expression problem, not how to drop columns.所以你的问题更像是一个正则表达式问题,而不是如何删除列。 If I run your code, I get an error:
如果我运行你的代码,我会收到一个错误:
df[, -grep("\\3$", colnames(df))]
Error in grep("\\3$", colnames(df)) :
invalid regular expression '\3$', reason 'Invalid back reference'
Update: Why don't you just use this following expression?更新:为什么不直接使用以下表达式?
df[, -grep("1$", colnames(df))]
ID A2 B2
1 1 2.0957940 k
2 2 -1.7177042 l
3 3 -0.0448357 m
4 4 1.2899925 n
5 5 0.7569659 o
6 6 -0.5048024 p
7 7 0.6929080 q
8 8 -0.5116399 r
9 9 -1.2621066 s
10 10 0.7664955 t
Just as an additional answer, since I stumbled across this, when looking for the data.table
solution to this problem.作为一个额外的答案,因为我在寻找这个问题的
data.table
解决方案时偶然发现了这一点。
library(data.table)
dt <- data.table(df)
drop.cols <- grep("1$", colnames(dt))
dt[, (drop.cols) := NULL]
For excluding any string you can use...要排除任何字符串,您可以使用...
# Search string to exclude
strng <- "1"
df <- data.frame(matrix(runif(25,max=10),nrow=5))
colnames(df) <- paste( "EX" , 1:5 )
df_red <- df[, -( grep(paste0( strng , "$" ) , colnames(df),perl = TRUE) ) ]
df
# EX 1 EX 2 EX 3 EX 4 EX 5
# 1 7.332913 4.972780 1.175947853 6.428073 8.625763
# 2 2.730271 3.734072 6.031157537 1.305951 8.012606
# 3 9.450122 3.259247 2.856123205 5.067294 7.027795
# 4 9.682430 5.295177 0.002015966 9.322912 7.424568
# 5 1.225359 1.577659 4.013616377 5.092042 5.130887
df_red
# EX 2 EX 3 EX 4 EX 5
# 1 4.972780 1.175947853 6.428073 8.625763
# 2 3.734072 6.031157537 1.305951 8.012606
# 3 3.259247 2.856123205 5.067294 7.027795
# 4 5.295177 0.002015966 9.322912 7.424568
# 5 1.577659 4.013616377 5.092042 5.130887
You can expand it further using regex for a broader pattern search.您可以使用正则表达式进一步扩展它以进行更广泛的模式搜索。 I have a data frame that has a bunch of columns with
"name"
, "upper_name" and
"lower_name"` as they represent confidence intervals for a bunch of series, but I don't need them all.我有一个数据框,它有一堆带有
"name"
、 "upper_name" and
"lower_name"` 的列,因为它们代表了一系列系列的置信区间,但我不需要它们。 So, using regex, you can do the following:因此,使用正则表达式,您可以执行以下操作:
pattern = "(upper_[a-z]*)|(lower_[a-z]*)"
policyData <- policyData[, -grep(pattern = pattern, colnames(policyData))]
The "|" “|” allows me to include an or statement in the regex so I can do it once with a single patter rather than look for each pattern.
允许我在正则表达式中包含一个 or 语句,这样我就可以用一个模式执行一次,而不是查找每个模式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.