简体   繁体   English

如果行通过测试,则为rowMean

[英]rowMean if row passes a test

I'm working on a data set where the source name is specified by a 2-letter abbreviation in front of the variable. 我正在处理一个数据集,其中源名称由变量前面的2个字母缩写指定。 So all variables from source AA start with AA_var1 , and source bb has bb_variable_name_2 . 因此,来自源AA的所有变量都以AA_var1 ,而源bb具有bb_variable_name_2 There are actually a lot of sources, and a lot of variable names, but I leave only 2 as a minimal example. 实际上有很多来源和很多变量名称,但我只留下2作为一个最小的例子。

I want to create a mean variable for any row where the number of sources, that is, where the number of unique prefixes for which the data on that row is not NA, is greater than 1. If there's only one source, I want that total variable to be NA. 我想为任何行创建一个平均变量,其中源的数量,即该行上的数据不是NA的唯一前缀的数量,大于1.如果只有一个源,我想要那个总变量为NA。

So, for example, my data looks like this: 因此,例如,我的数据如下所示:

> head(df)
  AA_var1 AA_var2   myid   bb_meow bb_A_v1
1      NA      NA 123456        10      12
2      NA      10 194200        12      NA
3      12      10 132200        NA      NA
4      12      NA 132201        NA      12
5      NA      NA 132202        NA      NA
6      12      13 132203        14      NA

And I want the following: 我想要以下内容:

> head(df)
  AA_var1 AA_var2   myid   bb_meow bb_A_v1  rowMeanIfDiverseData
1      NA      NA 123456        10      12                    NA #has only bb
2      NA      10 194200        12      NA                    11 #has AA and bb
3      12      10 132200        NA      NA                    NA #has only AA
4      12      NA 132201        NA      12                    12 #has AA and bb
5      NA      NA 132202        NA      NA                    NA #has neither
6      12      13 132203        14      NA                    13 #has AA and bb

Normally, I just use rowMeans() for this kind of thing. 通常,我只是使用rowMeans()这种事情。 But the additional subsetting of selecting only rows whose variable names follow a convention /at the row level/ has caught me confused between the item-level and the general apply-level statements I'm used to. 但是,仅选择变量名称遵循约定/行级别的行的附加子集使我在项目级别和我习惯的一般应用级别语句之间感到困惑。

I can get the prefixes at the dataframe level: 我可以在数据帧级别获取前缀:

mynames <- names(df[!names(df) %in% c("myid")])
tmp <- str_extract(mynames, perl("[A-Za-z]{2}(?=_)"))
uniq <- unique(tmp[!is.na(tmp)])

So, 所以,

> uniq
[1] "AA" "bb"

So, I can make this a function I can apply to df like so: 所以,我可以把它作为一个我可以应用于df的函数,如下所示:

multiSource <- function(x){
    nm = names(x[!names(x) %in% badnames])           # exclude c("myid")
    tmp <- str_extract(nm, perl("[A-Za-z]{2}(?=_)")) # get prefixes
    uniq <- unique(tmp[!is.na(tmp)])                 # ensure unique and not NA
    if (length(uniq) > 1){
        return(T)
    } else {
        return(F)
    }
 }

But this is clearly confused, and still getting data-set level, ie: 但这显然很混乱,仍然获得数据集级别,即:

> lapply(df,multiSource)
$AA_var1
[1] FALSE

$AA_var2
[1] FALSE

$bb_meow
[1] FALSE

$bb_A_v1
[1] FALSE

And... 和...

> apply(df,MARGIN=1,FUN=multiSource)

Gives TRUE for all. 为所有人提供TRUE。

I'd otherwise like to be saying... 我不想说...

df$rowMean <- rowMeans(df, na.rm=T)

# so, in this case
rowMeansIfTest <- function(X,test) {
   # is this row muliSource True?
   # if yes, return(rowMeans(X))
   # else return(NA)
}

df$rowMeanIfDiverseData <- rowMeansIfTest(df, test=multiSource)

But it is unclear to me how to do this without some kind of for loop. 但是我不清楚如何在没有某种for循环的情况下做到这一点。

The strategy here is to split the data frame by columns into variable groups, and for each row identifying if there are non-NA values. 这里的策略是将数据帧按列拆分为变量组,并为每行标识是否存在非NA值。 We then check with rowsums to make sure there are at least two variables with non-NA values for a row, and if so, add the mean of those values with cbind . 然后我们检查rowsums以确保至少有两个变量具有行的非NA值,如果是,则使用cbind添加这些值的cbind

This will generalize to any number of columns so long as they are named in the AA_varXXX format, and so long as the only column not in that format is myid . 这将推广到任意数量的列,只要它们以AA_varXXX格式命名,并且只要不是该格式的唯一列是myid Easy enough to fix if this isn't strictly the case, but these are the limitations on the code as written now. 如果不是严格的话,这很容易解决,但这些是现在编写的代码的限制。

df.dat <- df[!names(df) == "myid"]
diverse.rows <- rowSums(
  sapply(
    split.default(df.dat, gsub("^([A-Z]{2})_var.*", "\\1", names(df.dat))), 
    function(x) apply(x, 1, function(y) any(!is.na(y)))
) ) > 1
cbind(df, div.mean=ifelse(diverse.rows, rowMeans(df.dat, na.rm=T), NA))

Produces: 生产:

  AA_var1 AA_var2   myid BB_var3 BB_var4 div.mean
1      NA      NA 123456      10      12       NA
2      NA      10 194200      12      NA       11
3      12      10 132200      NA      NA       NA
4      12      NA 132201      NA      12       12
5      NA      NA 132202      NA      NA       NA
6      12      13 132203      14      NA       13

This solution seems a little convoluted to me, so there's probably a better way, but it should work for you. 这个解决方案对我来说似乎有点费解,所以可能有更好的方法,但它应该适合你。

# Here's your data:
df <- data.frame(AA_var1 = c(NA,NA,12,12,NA,12),
                 AA_var2 = c(NA,10,10,NA,NA,13),
                 BB_var3 = c(10,12,NA,NA,NA,14),
                 BB_var4 = c(12,NA,NA,12,NA,NA))

# calculate rowMeans for each subset of variables
a <- rowMeans(df[,grepl('AA',names(df))], na.rm=TRUE)
b <- rowMeans(df[,grepl('BB',names(df))], na.rm=TRUE)

# count non-missing values for each subset of variables
a2 <- rowSums(!is.na(df[,grepl('AA',names(df))]), na.rm=TRUE)
b2 <- rowSums(!is.na(df[,grepl('BB',names(df))]), na.rm=TRUE)

# calculate means:
rowSums(cbind(a*a2,b*b2)) /
    rowSums(!is.na(df[,grepl('[AA]|[BB]',names(df))]), na.rm=TRUE)

Result: 结果:

> df$rowMeanIfDiverseData <- rowSums(cbind(a*a2,b*b2)) /
+         rowSums(!is.na(df[,grepl('[AA]|[BB]',names(df))]), na.rm=TRUE)
> df
  AA_var1 AA_var2 BB_var3 BB_var4 rowMeanIfDiverseData
1      NA      NA      10      12                  NaN
2      NA      10      12      NA                   11
3      12      10      NA      NA                  NaN
4      12      NA      NA      12                   12
5      NA      NA      NA      NA                  NaN
6      12      13      14      NA                   13

And a little cleanup to exactly match your intended output: 并进行一些清理以完全匹配您的预期输出:

> df$rowMeanIfDiverseData[is.nan(df$rowMeanIfDiverseData)] <- NA
> df
  AA_var1 AA_var2 BB_var3 BB_var4 rowMeanIfDiverseData
1      NA      NA      10      12                   NA
2      NA      10      12      NA                   11
3      12      10      NA      NA                   NA
4      12      NA      NA      12                   12
5      NA      NA      NA      NA                   NA
6      12      13      14      NA                   13

My attempt, somewhat longwinded..... 我的尝试,有点啰嗦.....

dat<-data.frame(AA_var1=c(NA,NA,12,12,NA,12),
                    AA_var2=c(NA,10,10,NA,NA,13),
                    myid=1:6,
                    BB_var3=c(10,12,NA,NA,NA,14),
                    BB_var4=c(12,NA,NA,12,NA,NA))

#what columns are associated with variables used in our mean
varcols<-grep("*var[1-9]",names(dat),value=T)

#which rows have the requisite diversification of non-nulls
#i assume these columns will start with capitals and folloowed by underscore
meanrow<-apply(!is.na(dat[,varcols]),1,function(x){n<-varcols[x]
                              1<length(unique(regmatches(n,regexpr("[A-Z]+_",n))))
                                            })
#do the row mean for all 
dat$meanval<-rowMeans(dat[,varcols],na.rm=T)

#null out for those without diversification (i.e. !meanrow)
dat[!meanrow,"meanval"]<-NA

I think some of the answers are making this seem more complicated than it is. 我认为有些答案会使这看起来更复杂。 This will do it: 这样做:

df$means = ifelse(rowSums(!is.na(df[, grep('AA_var', names(df))])) &
                    rowSums(!is.na(df[, grep('BB_var', names(df))])),
                  rowMeans(df[, grep('_var', names(df))], na.rm = T), NA)
#  AA_var1 AA_var2   myid BB_var3 BB_var4 means
#1      NA      NA 123456      10      12    NA
#2      NA      10 194200      12      NA    11
#3      12      10 132200      NA      NA    NA
#4      12      NA 132201      NA      12    12
#5      NA      NA 132202      NA      NA    NA
#6      12      13 132203      14      NA    13

Here's a generalization of the above, given the comment, assuming unique id's (if they're not, create a unique index instead): 以下是给定注释的上述概括,假设唯一的id(如果它们不是,则创建一个唯一的索引):

library(data.table)
library(reshape2)

dt = data.table(df)
setkey(dt, myid) # not strictly necessary, but makes life easier

# find the conditional
cond = melt(dt, id.var = 'myid')[,
         sum(!is.na(value)), by = list(myid, sub('_var.*', '', variable))][,
         all(V1 != 0), keyby = myid]$V1

# fill in the means (could also do a join, but will rely on ordering instead)
dt[cond, means := rowMeans(.SD, na.rm = T), .SDcols = grep('_var', names(dt))]

dt
#   AA_var1 AA_var2   myid BB_var3 BB_var4 means
#1:      NA      NA 123456      10      12    NA
#2:      12      10 132200      NA      NA    NA
#3:      12      NA 132201      NA      12    12
#4:      NA      NA 132202      NA      NA    NA
#5:      12      13 132203      14      NA    13
#6:      NA      10 194200      12      NA    11
fun <- function(x) {
    MEAN <- mean(c(x[1], x[2], x[4], x[5]), na.rm=TRUE)
    CHECK <- sum(!is.na(c(x[1], x[2]))) > 0 & sum(!is.na(c(x[4], x[5])) > 0)
    MEAN * ifelse(CHECK, 1, NaN)
}
df$rowMeanIfDiverseData <- apply(df, 1, fun)
df

  AA_var1 AA_var2   myid BB_var3 BB_var4 rowMeanIfDiverseData
1      NA      NA 123456      10      12                  NaN
2      NA      10 194200      12      NA                   11
3      12      10 132200      NA      NA                  NaN
4      12      NA 132201      NA      12                   12
5      NA      NA 132202      NA      NA                  NaN
6      12      13 132203      14      NA                   13

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM