简体   繁体   中英

R data.table: how to use R variables that contain column names?

I've read the data.table documentation several times but I still can't wrap my head around how to do some operations; more generally I still haven't understood the underlying "philosophy" on how to work with variable names. Consider this example problem:

I have a data table with variables 'a', 'b', 'c', 'd':

> dt <- data.table(a=c(1,1,2), b=1:3, c=11:13, d=21:23)
> dt
   a b  c  d
1: 1 1 11 21
2: 1 2 12 22
3: 2 3 13 23

Suppose my script interactively asks the user to input a column name and corresponding value that should be used to select rows. These two variables are stored in rowselectname and rowselectvalue :

> rowselectname
[1] "a"
> rowselectvalue
[1] 1

The script also interactively asks the user to select some row names of interest; their names are stored in colselectnames :

> colselectnames
[1] "b" "d"

Now I want to create a new data table from dt , with the rows for which rowselectname has the value rowselectvalue , and with the columns given by colselectnames . The only way I finally managed to do this is as follows:

> newdt <- dt[get(rowselectname)==rowselectvalue, ..colselectnames]
> newdt
   b  d
1: 1 21
2: 2 22

What I don't understand is why I have to use get() for the first selection and .. for the second. Why not get() for both (it doesn't work)? Or why not .. for both (doesn't work either)? This seems inconsistent to me, but maybe there's another way of doing this with a more consistent syntax. I think the most obvious should simply be newdt <- dt[rowselectname==rowselectvalue, colselectnames] , which is how the rest of R seems to work.

I'd really appreciate someone explaining to me how to look at this to make sense of the syntax.

We can specify the colselectnames in .SDcols and select the .SD - as we are providing the column name as a string, get is used to return the value of the column. It can also be done by converting to symbol and evaluate ( eval(as.name(rowselectname)) )

dt[get(rowselectname)==rowselectvalue, .SD, .SDcols =  colselectnames]
   b  d
1: 1 21
2: 2 22

If we want to use .. operator, use that in the j

dt[dt[, ..rowselectname][[1]] == rowselectvalue, ..colselectnames]
   b  d
1: 1 21
2: 2 22

With the upcoming data.table version 1.14.3 , get will be retired, and you'll be able to use the new env parameter:

A new interface for programming on data.table has been added, closing #2655 and many other linked issues. It is built using base R's substitute-like interface via a new env argument to [.data.table. For details see the new vignette programming on data.table , and the new?substitute2 manual page.

# install dev version
install.packages("https://github.com/Rdatatable/data.table/archive/master.tar.gz",  repo = NULL, type = "source")

library(data.table)

dt[rowselectname==rowselectvalue, ..colselectnames, env=list(rowselectname=rowselectname)]

   b  d
1: 1 21
2: 2 22

allows for fast subsetting using the on argument and seconday indices (see the vignette Secondary indices and auto indexing , in particular chapter 2).

Using the on argument we can write

library(data.table)
dt[.(rowselectvalue), on = rowselectname, ..colselectnames]
 bd <int> <int> 1: 1 21 2: 2 22

This concise code is similar to a data.table join where the second data.table is created on-the-fly.


In case rowselectvalue is not found in column rowselectname , the result can be controlled by the nomatch argument.

By default, NA columns are returned, eg,

dt[.(rowselectvalue + 10), on = rowselectname, ..colselectnames]
 bd <int> <int> 1: NA NA

The argument nomatch = NULL ensures that an empty data.table is returned, eg,

dt[.(rowselectvalue + 10), on = rowselectname, ..colselectnames, nomatch = NULL]
 Empty data.table (0 rows and 2 cols): b,d

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM