[英]data.table: transforming subset of columns with a function, row by row
How can one, having a data.table with mostly numeric values, transform just a subset of columns and put them back to the original data table?拥有一个主要为数值的 data.table 怎么能只转换列的子集并将它们放回原始数据表? Generally, I don't want to add any summary statistic as a separate column, just exchange the transformed ones.
通常,我不想将任何汇总统计信息添加为单独的列,只需交换转换后的统计信息即可。
Assume we have a DT.假设我们有一个 DT。 It has 1 column with names and 10 columns with numeric values.
它有 1 列名称和 10 列数值。 I am interested in using "scale" function of base R for each row of that data table, but only applied to those 10 numeric columns.
我有兴趣为该数据表的每一行使用基数 R 的“缩放”函数,但仅适用于这 10 个数字列。
And to expand on this.并对此进行扩展。 What if I have a data table with more columns and I need to use column names to tell the scale function on which datapoints to apply the function?
如果我有一个包含更多列的数据表并且我需要使用列名来告诉 scale 函数在哪些数据点上应用该函数,该怎么办?
With regular data.frame I would just do:使用常规 data.frame 我会这样做:
df[,grep("keyword",colnames(df))] <- t(apply(df[,grep("keyword",colnames(df))],1,scale))
I know this looks cumbersome but always worked for me.我知道这看起来很麻烦,但总是对我有用。 However, I can't figure out a simple way to do it in data.tables.
但是,我无法在 data.tables 中找到一种简单的方法。
I would image something like this to work for data.tables:我想像这样的东西为 data.tables 工作:
dt[,grep("keyword",colnames(dt)) := scale(grep("keyword",colnames(dt)),center=F)]
But it doesn't.但事实并非如此。
EDIT:编辑:
Another example of doing that updating columns with their per-row-scaled version:使用按行缩放的版本更新列的另一个示例:
dt = data.table object dt = data.table 对象
dt[,grep("keyword",colnames(dt),value=T) := as.data.table(t(apply(dt[,grep("keyword",colnames(dt)),with=F],1,scale)))]
Too bad it needs the "as.data.table" part inside, as the transposed value from apply function is a matrix.太糟糕了,它需要内部的“as.data.table”部分,因为来自 apply 函数的转置值是一个矩阵。 Maybe data.table should automatically coerce matrices into data.tables upon updating of columns?
也许 data.table 应该在更新列时自动将矩阵强制转换为 data.tables?
If what you need is really to scale by row, you can try doing it in 2 steps:如果您确实需要按行缩放,您可以尝试分两步进行:
# compute mean/sd:
mean_sd <- DT[, .(mean(unlist(.SD)), sd(unlist(.SD))), by=1:nrow(DT), .SDcols=grep("keyword",colnames(DT))]
# scale
DT[, grep("keyword",colnames(DT), value=TRUE) := lapply(.SD, function(x) (x-mean_sd$V1)/mean_sd$V2), .SDcols=grep("keyword",colnames(DT))]
# First lets take a look at the data in the columns:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]`
One-line Solution Version 1: Use magrittR and the pipe operator:单行解决方案版本 1:使用 magrittR 和管道运算符:
DT[, (grep("keyword", colnames(DT))) := (lapply(.SD, . %>% scale(., center = F))),
.SDcols = grep("corrupt", colnames(DT))]
One-line Solution Version 2: Explicitly defines the function for the lapply:单行解决方案版本 2:显式定义 lapply 的函数:
DT[, (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)})),
.SDcols = grep("corrupt", colnames(DT))]
Modification - If you want to do it by group, just use the by =修改 - 如果要按组进行,只需使用 by =
DT[ , (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)}))
, .SDcols = grep("corrupt", colnames(DT))
, by = Grouping.Variable]
You can verify:您可以验证:
# Verify that the columns have updated values:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]
The above solution works clearly for the narrow example given.上述解决方案对于给出的狭义示例显然有效。
As a public service, I am posting this for anyone that is still searching for a way that作为一项公共服务,我向任何仍在寻找一种方式的人发布此信息
# You get a data.table called DT
DT <- as.data.table(df)
# Get the list of names
Reference.Cols <- grep("keyword",colnames(df))
# FOR PEOPLE who want to store both transformed and untransformed values.
# Create new column names
Reference.Cols.normalized <- Reference.Cols %>% paste(., ".normalized", sep = "")
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X,
X.mean = mean(X, na.rm = TRUE),
X.sd = sd(X, na.rm = TRUE))
{
X <- (X - X.mean) / X.sd
return(X)
}
# Voila, the newly created set of columns the contain the transformed value,
DT[, (Reference.Cols.normalized) := lapply(.SD, normalize), .SDcols = Reference.Cols]
DT[, .SD, .SDcols = Reference.Cols.normalized]
DT[, .SD, .SDcols = Reference.Cols]
Hopefully, for those of you who return to look at code after some interval, this more step-by-step / general approach can be helpful.希望对于那些在一段时间后返回查看代码的人来说,这种更逐步/通用的方法可能会有所帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.