简体   繁体   English

如何从 data.frame 中删除列?

[英]How do you remove columns from a data.frame?

Not so much 'How do you...?'没有那么多“你怎么……?” but more 'How do YOU...?'但更多“你怎么……?”

If you have a file someone gives you with 200 columns, and you want to reduce it to the few ones you need for analysis, how do you go about it?如果您有一个有人给您的文件,其中包含 200 列,并且您想将其减少到分析所需的少数几列,您如何看待它? Does one solution offer benefits over another?一种解决方案是否比另一种解决方案更有优势?

Assuming we have a data frame with columns col1, col2 through col200.假设我们有一个包含 col1、col2 到 col200 列的数据框。 If you only wanted 1-100 and then 125-135 and 150-200, you could:如果您只想要 1-100,然后是 125-135 和 150-200,您可以:

dat$col101 <- NULL
dat$col102 <- NULL # etc

or或者

dat <- dat[,c("col1","col2",...)]

or或者

dat <- dat[,c(1:100,125:135,...)] # shortest probably but I don't like this

or或者

dat <- dat[,!names(dat) %in% c("dat101","dat102",...)]

Anything else I'm missing?还有什么我想念的吗? I know this is sightly subjective but it's one of those nitty gritty things where you might dive in and start doing it one way and fall into a habit when there are far more efficient ways out there.我知道这显然是主观的,但这是你可能会潜入并开始以一种方式做事并在有更有效的方法时养成习惯的那些细节之一。 Much like this question about which .很像这个关于which的问题。

EDIT:编辑:

Or, is there an easy way to create a workable vector of column names?或者,有没有一种简单的方法来创建一个可行的列名向量? name(dat) doesn't print them with commas in between, which you need in the code examples above, so if you print out the names in that way you have spaces everywhere and have to manually put in commas... Is there a command that will give you "col1","col2","col3",... as your output so you can easily grab what you want? name(dat) 不会在它们之间用逗号打印它们,这是您在上面的代码示例中需要的,所以如果您以这种方式打印出名称,那么您到处都有空格并且必须手动输入逗号......命令将为您提供 "col1","col2","col3",... 作为您的 output 以便您可以轻松获取您想要的东西?

I use data.table 's := operator to delete columns instantly regardless of the size of the table.无论表的大小如何,我都使用data.table:=运算符立即删除列。

DT[, coltodelete := NULL]

or或者

DT[, c("col1","col20") := NULL]

or或者

DT[, (125:135) := NULL]

or或者

DT[, (variableHoldingNamesOrNumbers) := NULL]

Any solution using <- or subset will copy the whole table.任何使用<-subset的解决方案都将复制整个表。 data.table 's := operator merely modifies the internal vector of pointers to the columns, in place. data.table:=运算符仅在适当位置修改指向列的指针的内部向量。 That operation is therefore (almost) instant.因此,该操作(几乎)是即时的。

To delete single columns, I'll just use dat$x <- NULL .要删除单个列,我将只使用dat$x <- NULL

To delete multiple columns, but less than about 3-4, I'll use dat$x <- dat$y <- dat$z <- NULL .要删除多列,但少于大约 3-4,我将使用dat$x <- dat$y <- dat$z <- NULL

For more than that, I'll use subset , with negative names (:):除此之外,我将使用带有负名称 (:) 的subset

subset(mtcars, , -c(mpg, cyl, disp, hp))

For clarity purposes, I often use the select argument in subset .为清楚起见,我经常在subset中使用 select 参数。 With newer folks, I've learned that keeping the # of commands they need to pick up to a minimum helps adoption.对于新人,我了解到将他们需要接受的命令数量保持在最低限度有助于采用。 As their skills increase, so too will their coding ability.随着他们技能的提高,他们的编码能力也会提高。 And subset is one of the first commands I show people when needing to select data within a given criteria.当需要 select 数据在给定标准内时,subset 是我向人们展示的第一个命令。

Something like:就像是:

> subset(mtcars, select = c("mpg", "cyl", "vs", "am"))
                     mpg cyl vs am
Mazda RX4           21.0   6  0  1
Mazda RX4 Wag       21.0   6  0  1
Datsun 710          22.8   4  1  1
....

I'm sure this will test slower than most other solutions, but I'm rarely at the point where microseconds make a difference.我确信这会比大多数其他解决方案测试得慢,但我很少处于微秒产生影响的地步。

Use read.table with colClasses instances of "NULL" to avoid creating them in the first place:将 read.table 与 colClasses 的“NULL”实例一起使用,以避免首先创建它们:

## example data and temp file
x <- data.frame(x = 1:10, y = rnorm(10), z = runif(10), a = letters[1:10], stringsAsFactors = FALSE)
tmp <- tempfile()
write.table(x, tmp, row.names = FALSE)


(y <- read.table(tmp, colClasses = c("numeric", rep("NULL", 2), "character"), header = TRUE))

x a
1   1 a
2   2 b
3   3 c
4   4 d
5   5 e
6   6 f
7   7 g
8   8 h
9   9 i
10 10 j

unlink(tmp)

For the kinds of large files I tend to get, I generally wouldn't even do this in R.对于我倾向于得到的那种大文件,我通常不会在 R 中这样做。 I would use the cut command in Linux to process data before it gets to R.我将使用 Linux 中的cut命令在数据到达 R 之前对其进行处理。 This isn't a critique of R, just a preference for using some very basic Linux tools like grep, tr, cut, sort, uniq, and occasionally sed & awk (or Perl) when there's something to be done about regular expressions. This isn't a critique of R, just a preference for using some very basic Linux tools like grep, tr, cut, sort, uniq, and occasionally sed & awk (or Perl) when there's something to be done about regular expressions.

Another reason to use standard GNU commands is that I can pass them back to the source of the data and ask that they prefilter the data so that I don't get extraneous data.使用标准 GNU 命令的另一个原因是我可以将它们传递回数据源并要求它们对数据进行预过滤,这样我就不会得到无关数据。 Most of my colleagues are competent with Linux, fewer know R.我的大多数同事都能够胜任 Linux,很少有人知道 R。

(Updated) A method that I would like to use before long is to pair mmap with a text file and examine the data in situ , rather than read it at all into RAM. (更新)我想不久使用的一种方法是将mmap与文本文件配对并就地检查数据,而不是将其完全读入 RAM。 I have done this with C, and it can be blisteringly fast.我已经用 C 做到了这一点,而且速度非常快。

Sometimes I like to do this using column ids instead.有时我喜欢使用列 ID 来代替。

df <- data.frame(a=rnorm(100),
b=rnorm(100),
c=rnorm(100),
d=rnorm(100),
e=rnorm(100),
f=rnorm(100),
g=rnorm(100)) 

as.data.frame(names(df)) as.data.frame(名称(df))

  names(df)
1         a
2         b
3         c
4         d
5         e
6         f
7         g 

Removing columns "c" and "g"删除列“c”和“g”

df[,-c(3,7)]

This is especially useful if you have data.frames that are large or have long column names that you don't want to type.如果您有很大的 data.frames 或您不想键入的长列名称,这将特别有用。 Or column names that follow a pattern, because then you can use seq() to remove.或遵循某种模式的列名,因为这样您就可以使用 seq() 来删除。

RE: Your edit回复:您的编辑

You don't necessarily have to put "" around a string, nor "," to create a character vector.您不必一定要在字符串周围加上“”,也不必用“,”来创建字符向量。 I find this little trick handy:我觉得这个小技巧很方便:

x <- unlist(strsplit(
'A
B
C
D
E',"\n"))

From http://www.statmethods.net/management/subset.html来自http://www.statmethods.net/management/subset.html

# exclude variables v1, v2, v3
myvars <- names(mydata) %in% c("v1", "v2", "v3") 
newdata <- mydata[!myvars]

# exclude 3rd and 5th variable 
newdata <- mydata[c(-3,-5)]

# delete variables v3 and v5
mydata$v3 <- mydata$v5 <- NULL

Thought it was really clever make a list of "not to include"认为这真的很聪明,列出“不包括在内”

Can use setdiff function:可以使用setdiff function:

If there are more columns to keep than to delete: Suppose you want to delete 2 columns say col1, col2 from a data.frame DT;如果要保留的列多于删除的列:假设您要从 data.frame DT 中删除2 列,例如 col1、col2; you can do the following:您可以执行以下操作:

DT<-DT[,setdiff(names(DT),c("col1","col2"))]

If there are more columns to delete than to keep: Suppose you want to keep only col1 and col2:如果要删除的列多于保留的列:假设您只想保留col1 和 col2:

DT<-DT[,c("col1","col2")]

If you have a vector of names already,which there are several ways to create, you can easily use the subset function to keep or drop an object.如果您已经有一个名称向量,有多种创建方法,您可以轻松地使用子集 function 来保留或删除 object。

dat2 <- subset(dat, select = names(dat) %in% c(KEEP))

In this case KEEP is a vector of column names which is pre-created.在这种情况下,KEEP 是预先创建的列名向量。 For example:例如:

#sample data via Brandon Bertelsen
df <- data.frame(a=rnorm(100),
                 b=rnorm(100),
                 c=rnorm(100),
                 d=rnorm(100),
                 e=rnorm(100),
                 f=rnorm(100),
                 g=rnorm(100))

#creating the initial vector of names
df1 <- as.matrix(as.character(names(df)))

#retaining only the name values you want to keep
KEEP <- as.vector(df1[c(1:3,5,6),])

#subsetting the intial dataset with the object KEEP
df3 <- subset(df, select = names(df) %in% c(KEEP))

Which results in:结果是:

> head(df)
            a          b           c          d
1  1.05526388  0.6316023 -0.04230455 -0.1486299
2 -0.52584236  0.5596705  2.26831758  0.3871873
3  1.88565261  0.9727644  0.99708383  1.8495017
4 -0.58942525 -0.3874654  0.48173439  1.4137227
5 -0.03898588 -1.5297600  0.85594964  0.7353428
6  1.58860643 -1.6878690  0.79997390  1.1935813
            e           f           g
1 -1.42751190  0.09842343 -0.01543444
2 -0.62431091 -0.33265572 -0.15539472
3  1.15130591  0.37556903 -1.46640276
4 -1.28886526 -0.50547059 -2.20156926
5 -0.03915009 -1.38281923  0.60811360
6 -1.68024349 -1.18317733  0.42014397

> head(df3)
        a          b           c           e
1  1.05526388  0.6316023 -0.04230455 -1.42751190
2 -0.52584236  0.5596705  2.26831758 -0.62431091
3  1.88565261  0.9727644  0.99708383  1.15130591
4 -0.58942525 -0.3874654  0.48173439 -1.28886526
5 -0.03898588 -1.5297600  0.85594964 -0.03915009
6  1.58860643 -1.6878690  0.79997390 -1.68024349
            f
1  0.09842343
2 -0.33265572
3  0.37556903
4 -0.50547059
5 -1.38281923
6 -1.18317733

The select() function from dplyr is powerful for subsetting columns. dplyr 中的select() function 对于子集列非常强大。 See ?select_helpers for a list of approaches.有关方法列表,请参阅?select_helpers

In this case, where you have a common prefix and sequential numbers for column names, you could use num_range :在这种情况下,如果列名有一个共同的前缀和序号,则可以使用num_range

library(dplyr)

df1 <- data.frame(first = 0, col1 = 1, col2 = 2, col3 = 3, col4 = 4)
df1 %>%
  select(num_range("col", c(1, 4)))
#>   col1 col4
#> 1    1    4

More generally you can use the minus sign in select() to drop columns, like:更一般地,您可以使用select()中的减号来删除列,例如:

mtcars %>%
   select(-mpg, -wt)

Finally, to your question "is there an easy way to create a workable vector of column names?"最后,对于您的问题“是否有一种简单的方法可以创建一个可行的列名向量?” - yes, if you need to edit a list of names manually, use dput to get a comma-separated, quoted list you can easily manipulate: - 是的,如果您需要手动编辑名称列表,请使用dput获取一个逗号分隔的引用列表,您可以轻松操作:

dput(names(mtcars))
#> c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", 
#> "gear", "carb")

Just addressing the edit.只是解决编辑。

@nzcoops, you do not need the column names in a comma delimited character vector. @nzcoops,您不需要逗号分隔字符向量中的列名。 You are thinking about this the wrong way round.您正在以错误的方式思考这个问题。 When you do当你这样做

vec <- c("col1", "col2", "col3")

you are creating a character vector.您正在创建一个字符向量。 The , just separates arguments taken by the c() function when you define that vector. ,仅在定义该向量时分隔由c() function 获取的 arguments 。 names() and similar functions return a character vector of names. names()和类似函数返回名称的字符向量。

> dat <- data.frame(col1 = 1:3, col2 = 1:3, col3 = 1:3)
> dat
  col1 col2 col3
1    1    1    1
2    2    2    2
3    3    3    3
> names(dat)
[1] "col1" "col2" "col3"

It is far easier and less error prone to select from the elements of names(dat) than to process its output to a comma separated string you can cut and paste from.names(dat)的元素中产生 select 比将其 output 处理为可以剪切和粘贴的逗号分隔字符串要容易得多且不易出错。

Say we want columns col1 and col2 , subset names(dat) , retaining only the ones we want:假设我们想要列col1col2 ,子集names(dat) ,只保留我们想要的:

> names(dat)[c(1,3)]
[1] "col1" "col3"
> dat[, names(dat)[c(1,3)]]
  col1 col3
1    1    1
2    2    2
3    3    3

You can kind of do what you want, but R will always print the vector the screen in quotes " :你可以做你想做的事,但 R 将始终在屏幕上用引号 " 打印矢量"

> paste('"', names(dat), '"', sep = "", collapse = ", ")
[1] "\"col1\", \"col2\", \"col3\""
> paste("'", names(dat), "'", sep = "", collapse = ", ")
[1] "'col1', 'col2', 'col3'"

so the latter may be more useful.所以后者可能更有用。 However, now you have to cut and past from that string.但是,现在您必须从该字符串中剪切和过去。 Far better to work with objects that return what you want and use standard subsetting routines to keep what you need.使用返回您想要的对象并使用标准子集例程来保留您需要的对象要好得多。

rm in within can be quite useful. rm in within可能非常有用。

within(mtcars, rm(mpg, cyl, disp, hp))
#                     drat    wt  qsec vs am gear carb
# Mazda RX4           3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag       3.90 2.875 17.02  0  1    4    4
# Datsun 710          3.85 2.320 18.61  1  1    4    1
# Hornet 4 Drive      3.08 3.215 19.44  1  0    3    1
# Hornet Sportabout   3.15 3.440 17.02  0  0    3    2
# Valiant             2.76 3.460 20.22  1  0    3    1
# ...

May be combined with other operations.可与其他操作结合使用。

within(mtcars, {
  mpg2=mpg^2
  cyl2=cyl^2
  rm(mpg, cyl, disp, hp)
  })
#                     drat    wt  qsec vs am gear carb cyl2    mpg2
# Mazda RX4           3.90 2.620 16.46  0  1    4    4   36  441.00
# Mazda RX4 Wag       3.90 2.875 17.02  0  1    4    4   36  441.00
# Datsun 710          3.85 2.320 18.61  1  1    4    1   16  519.84
# Hornet 4 Drive      3.08 3.215 19.44  1  0    3    1   36  457.96
# Hornet Sportabout   3.15 3.440 17.02  0  0    3    2   64  349.69
# Valiant             2.76 3.460 20.22  1  0    3    1   36  327.61
# ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM