简体   繁体   English

data.table:用函数逐行转换列的子集

[英]data.table: transforming subset of columns with a function, row by row

How can one, having a data.table with mostly numeric values, transform just a subset of columns and put them back to the original data table?拥有一个主要为数值的 data.table 怎么能只转换列的子集并将它们放回原始数据表? Generally, I don't want to add any summary statistic as a separate column, just exchange the transformed ones.通常,我不想将任何汇总统计信息添加为单独的列,只需交换转换后的统计信息即可。

Assume we have a DT.假设我们有一个 DT。 It has 1 column with names and 10 columns with numeric values.它有 1 列名称和 10 列数值。 I am interested in using "scale" function of base R for each row of that data table, but only applied to those 10 numeric columns.我有兴趣为该数据表的每一行使用基数 R 的“缩放”函数,但仅适用于这 10 个数字列。

And to expand on this.并对此进行扩展。 What if I have a data table with more columns and I need to use column names to tell the scale function on which datapoints to apply the function?如果我有一个包含更多列的数据表并且我需要使用列名来告诉 scale 函数在哪些数据点上应用该函数,该怎么办?

With regular data.frame I would just do:使用常规 data.frame 我会这样做:

df[,grep("keyword",colnames(df))] <- t(apply(df[,grep("keyword",colnames(df))],1,scale))

I know this looks cumbersome but always worked for me.我知道这看起来很麻烦,但总是对我有用。 However, I can't figure out a simple way to do it in data.tables.但是,我无法在 data.tables 中找到一种简单的方法。

I would image something like this to work for data.tables:我想像这样的东西为 data.tables 工作:

dt[,grep("keyword",colnames(dt)) := scale(grep("keyword",colnames(dt)),center=F)]

But it doesn't.但事实并非如此。

EDIT:编辑:

Another example of doing that updating columns with their per-row-scaled version:使用按行缩放的版本更新列的另一个示例:

dt = data.table object dt = data.table 对象

dt[,grep("keyword",colnames(dt),value=T) := as.data.table(t(apply(dt[,grep("keyword",colnames(dt)),with=F],1,scale)))]

Too bad it needs the "as.data.table" part inside, as the transposed value from apply function is a matrix.太糟糕了,它需要内部的“as.data.table”部分,因为来自 apply 函数的转置值是一个矩阵。 Maybe data.table should automatically coerce matrices into data.tables upon updating of columns?也许 data.table 应该在更新列时自动将矩阵强制转换为 data.tables?

If what you need is really to scale by row, you can try doing it in 2 steps:如果您确实需要按行缩放,您可以尝试分两步进行:

# compute mean/sd:
mean_sd <- DT[, .(mean(unlist(.SD)), sd(unlist(.SD))), by=1:nrow(DT), .SDcols=grep("keyword",colnames(DT))]

# scale
DT[, grep("keyword",colnames(DT), value=TRUE) := lapply(.SD, function(x) (x-mean_sd$V1)/mean_sd$V2), .SDcols=grep("keyword",colnames(DT))]

PART 1: The one line solution you requested:第 1 部分:您要求的单行解决方案

# First lets take a look at the data in the columns:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]`

One-line Solution Version 1: Use magrittR and the pipe operator:单行解决方案版本 1:使用 magrittR 和管道运算符:

DT[, (grep("keyword", colnames(DT))) := (lapply(.SD, . %>% scale(., center = F))),
    .SDcols = grep("corrupt", colnames(DT))]

One-line Solution Version 2: Explicitly defines the function for the lapply:单行解决方案版本 2:显式定义 lapply 的函数:

DT[, (grep("keyword", colnames(DT))) := 
     (lapply(.SD, function(x){scale(x, center = F)})), 
     .SDcols = grep("corrupt", colnames(DT))]

Modification - If you want to do it by group, just use the by =修改 - 如果要按组进行,只需使用 by =

DT[  , (grep("keyword", colnames(DT))) := 
              (lapply(.SD, function(x){scale(x, center = F)}))
     , .SDcols = grep("corrupt", colnames(DT))
     , by = Grouping.Variable]

You can verify:您可以验证:

# Verify that the columns have updated values:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]

PART 2: A Step-by-Step Solution: (more general and easier to follow)第 2 部分:分步解决方案:(更通用且更易于遵循)

The above solution works clearly for the narrow example given.上述解决方案对于给出的狭义示例显然有效。

As a public service, I am posting this for anyone that is still searching for a way that作为一项公共服务,我向任何仍在寻找一种方式的人发布此信息

  • feels a bit less condensed;感觉不那么凝缩了;
  • easier to understand;更容易理解;
  • more general, in the sense that you can apply any function you wish without having to compute the values into a separate data table first (which, nb does work perfectly here)更一般地说,从某种意义上说,您可以应用任何您想要的函数,而不必先将值计算到单独的数据表中(nb 在这里工作得很好)

Here's the step-by-step way of doing the same:这是执行相同操作的分步方法:

Get the data into Data.Table format:获取数据为Data.Table格式:

# You get a data.table called DT
DT <- as.data.table(df)

Then, Handle the Column Names:然后,处理列名:

# Get the list of names
Reference.Cols <- grep("keyword",colnames(df))



# FOR PEOPLE who want to store both transformed and untransformed values. 
# Create new column names
Reference.Cols.normalized <- Reference.Cols %>% paste(., ".normalized", sep = "")

Define the function you want to apply定义要应用的功能

#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:

normalize <- function(X, 
                      X.mean = mean(X, na.rm = TRUE), 
                      X.sd = sd(X, na.rm = TRUE))
                      {
                          X <- (X - X.mean) / X.sd
                          return(X)
                      }

After that, it is trivial in Data.Table syntax:之后,在 Data.Table 语法中是微不足道的:

# Voila, the newly created set of columns the contain the transformed value, 
DT[, (Reference.Cols.normalized) := lapply(.SD, normalize), .SDcols = Reference.Cols]

Verify:核实:

new values stored in columns with names stored in: 新值存储在名称存储在以下位置的列中:

DT[, .SD, .SDcols = Reference.Cols.normalized]

Untransformed values left unharmed 未转换的值不受损害

DT[, .SD, .SDcols = Reference.Cols]

Hopefully, for those of you who return to look at code after some interval, this more step-by-step / general approach can be helpful.希望对于那些在一段时间后返回查看代码的人来说,这种更逐步/通用的方法可能会有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM