简体   繁体   中英

data.table: transforming subset of columns with a function, row by row

How can one, having a data.table with mostly numeric values, transform just a subset of columns and put them back to the original data table? Generally, I don't want to add any summary statistic as a separate column, just exchange the transformed ones.

Assume we have a DT. It has 1 column with names and 10 columns with numeric values. I am interested in using "scale" function of base R for each row of that data table, but only applied to those 10 numeric columns.

And to expand on this. What if I have a data table with more columns and I need to use column names to tell the scale function on which datapoints to apply the function?

With regular data.frame I would just do:

df[,grep("keyword",colnames(df))] <- t(apply(df[,grep("keyword",colnames(df))],1,scale))

I know this looks cumbersome but always worked for me. However, I can't figure out a simple way to do it in data.tables.

I would image something like this to work for data.tables:

dt[,grep("keyword",colnames(dt)) := scale(grep("keyword",colnames(dt)),center=F)]

But it doesn't.

EDIT:

Another example of doing that updating columns with their per-row-scaled version:

dt = data.table object

dt[,grep("keyword",colnames(dt),value=T) := as.data.table(t(apply(dt[,grep("keyword",colnames(dt)),with=F],1,scale)))]

Too bad it needs the "as.data.table" part inside, as the transposed value from apply function is a matrix. Maybe data.table should automatically coerce matrices into data.tables upon updating of columns?

If what you need is really to scale by row, you can try doing it in 2 steps:

# compute mean/sd:
mean_sd <- DT[, .(mean(unlist(.SD)), sd(unlist(.SD))), by=1:nrow(DT), .SDcols=grep("keyword",colnames(DT))]

# scale
DT[, grep("keyword",colnames(DT), value=TRUE) := lapply(.SD, function(x) (x-mean_sd$V1)/mean_sd$V2), .SDcols=grep("keyword",colnames(DT))]

PART 1: The one line solution you requested:

# First lets take a look at the data in the columns:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]`

One-line Solution Version 1: Use magrittR and the pipe operator:

DT[, (grep("keyword", colnames(DT))) := (lapply(.SD, . %>% scale(., center = F))),
    .SDcols = grep("corrupt", colnames(DT))]

One-line Solution Version 2: Explicitly defines the function for the lapply:

DT[, (grep("keyword", colnames(DT))) := 
     (lapply(.SD, function(x){scale(x, center = F)})), 
     .SDcols = grep("corrupt", colnames(DT))]

Modification - If you want to do it by group, just use the by =

DT[  , (grep("keyword", colnames(DT))) := 
              (lapply(.SD, function(x){scale(x, center = F)}))
     , .SDcols = grep("corrupt", colnames(DT))
     , by = Grouping.Variable]

You can verify:

# Verify that the columns have updated values:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]

PART 2: A Step-by-Step Solution: (more general and easier to follow)

The above solution works clearly for the narrow example given.

As a public service, I am posting this for anyone that is still searching for a way that

  • feels a bit less condensed;
  • easier to understand;
  • more general, in the sense that you can apply any function you wish without having to compute the values into a separate data table first (which, nb does work perfectly here)

Here's the step-by-step way of doing the same:

Get the data into Data.Table format:

# You get a data.table called DT
DT <- as.data.table(df)

Then, Handle the Column Names:

# Get the list of names
Reference.Cols <- grep("keyword",colnames(df))



# FOR PEOPLE who want to store both transformed and untransformed values. 
# Create new column names
Reference.Cols.normalized <- Reference.Cols %>% paste(., ".normalized", sep = "")

Define the function you want to apply

#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:

normalize <- function(X, 
                      X.mean = mean(X, na.rm = TRUE), 
                      X.sd = sd(X, na.rm = TRUE))
                      {
                          X <- (X - X.mean) / X.sd
                          return(X)
                      }

After that, it is trivial in Data.Table syntax:

# Voila, the newly created set of columns the contain the transformed value, 
DT[, (Reference.Cols.normalized) := lapply(.SD, normalize), .SDcols = Reference.Cols]

Verify:

new values stored in columns with names stored in:

DT[, .SD, .SDcols = Reference.Cols.normalized]

Untransformed values left unharmed

DT[, .SD, .SDcols = Reference.Cols]

Hopefully, for those of you who return to look at code after some interval, this more step-by-step / general approach can be helpful.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM