简体   繁体   English

我如何在没有循环的情况下通过数据帧中该级别中另一个因子的子集来操作因子级别内的数据

[英]How can i manipulate data within a factor level by a subset of another factor in that level in a dataframe without loops

I have a data frame made up of absorption spectra from multiple sample runs (sample a, b, c, d), where Ydata is wavelength and Xdata is absorption.我有一个由多次样品运行(样品 a、b、c、d)的吸收光谱组成的数据框,其中 Ydata 是波长,Xdata 是吸收。 I am calculating a baseline corrected absorption by subtracting the average absorption over a quiet wavelength range away from peaks of interest.我通过从感兴趣的峰值减去安静波长范围内的平均吸收来计算基线校正吸收。

simplified dataframe:简化数据框:

DF <- data.frame(
  group = rep(c("a", "b", "c", "d"),each=10),
  Ydata = rep(1:10, times = 4),
  Xdata = c(seq(1,10,1),seq(5,50,5),seq(20,11,-1),seq(0.3,3,0.3)),
  abscorr = NA
)

I need to correct each sample run by subtracting the mean of a subsetted wavelength range within the run.我需要通过减去运行中子集波长范围的平均值来校正每个样本运行。 I've been doing it this way:我一直这样做:

for (i in 1:length(levels(DF$group))){
  sub1 <- subset(DF, group == levels(DF$group)[i], select = c(group, Ydata, 
  Xdata));
  sub2 <- subset(sub1, Ydata > 4 & Ydata < 8, select = c(group, Ydata, 
  Xdata));
  sub1$abscorr <- sub1$Xdata - mean(sub2$Xdata);
  DF <- rbind(sub1, DF);
}

and then tidy up all the 'NA's然后整理所有的 'NA's

DF <- na.omit(DF)

The way done above is obviously clunky with use of loops.使用循环时,上述方法显然很笨拙。 Is there a better way to go about this task for a large dataset?对于大型数据集,有没有更好的方法来完成这项任务? perhaps dplyr?也许dplyr?

Try dplyr :试试dplyr

DF %>%
    group_by(group) %>%
    mutate(abscorr = Xdata - mean(Xdata[Ydata < 8 & Ydata > 4]))

I believe this will do it.我相信这会做到。

fun <- function(x){
    x$Xdata - mean(x[which(x$Ydata > 4 & x$Ydata < 8), "Xdata"])
}
DF$abscorr <- do.call(c, lapply(split(DF, DF$group), fun))

Note that when I tested it, all.equal gave me a series of differences, namely the attributes of the two results are different.注意我测试的时候all.equal给了我一系列的区别,即两个结果的属性不同。 So I ran the following.所以我运行了以下内容。

fun <- function(x){
    x$Xdata - mean(x[which(x$Ydata > 4 & x$Ydata < 8), "Xdata"])
}
DF2 <- DF
DF2$abscorr <- do.call(c, lapply(split(DF2, DF2$group), fun))

all.equal(DF[order(DF$group, DF$Ydata), ], DF2)
# [1] "Attributes: < Names: 1 string mismatch >"                                         
# [2] "Attributes: < Length mismatch: comparison on first 2 components >"                
# [3] "Attributes: < Component 2: names for target but not for current >"                
# [4] "Attributes: < Component 2: Attributes: < Modes: list, NULL > >"                   
# [5] "Attributes: < Component 2: Attributes: < Lengths: 1, 0 > >"                       
# [6] "Attributes: < Component 2: Attributes: < names for target but not for current > >"
# [7] "Attributes: < Component 2: Attributes: < current is not list-like > >"            
# [8] "Attributes: < Component 2: target is omit, current is numeric >"                  
# [9] "Component “abscorr”: Modes: numeric, logical"                                     
#[10] "Component “abscorr”: target is numeric, current is logical"

As you can see there is no difference in the calculated values of abscorr , only in the attributes.如您所见, abscorr的计算值没有区别,仅在属性中。 Among those, there are differences in the na.omit attribute or the rownames .在这些,也有差异na.omit属性或rownames I wouldn't worry if I were you, since the values of abscorr are equal.如果我是你,我不会担心,因为abscorr的值是相等的。

EDIT.编辑。
Note that if I sort DF and then set the problem attributes to NULL both all.equal and the much more strict identical return TRUE .请注意,如果我对DF排序,然后将问题属性设置为NULL all.equal和更严格的identical返回TRUE

DF1 <- DF[order(DF$group, DF$Ydata), ]  # Modify a copy, keep the original
row.names(DF1) <- NULL
attr(DF1, "na.action") <- NULL

all.equal(DF1, DF2)
#[1] TRUE
identical(DF1, DF2)
#[1] TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM