简体   繁体   English

R:关于向量的子集

[英]R: subsetting with respect to a vector

I am trying to figure out how to apply a certain function only to the rows of a matrix with the same entry on the last column, but no luck until now. 我试图弄清楚如何只对最后一列具有相同条目的矩阵行应用某个函数,但是到目前为止没有运气。

My matrix (that we are going to call simply matrix and suppose it is 5x4) looks like this: 我的矩阵(我们将简称为matrix ,假设它是5x4)如下所示:

d1.1   d1.2   d1.3   NAME1 
d2.1   d2.2   d2.3   NAME1 
d3.1   d3.2   d3.3   NAME2 
d4.1   d4.2   d4.3   NAME3
d5.1   d5.2   d5.3   NAME2`

I want to perform the summary statistics fun1 on the rows with the same name, in order to get a final matrix that looks like this: 我想对具有相同名称的行执行汇总统计信息fun1 ,以便获得如下所示的最终矩阵:

fun1(d1.1, d2.1)   fun1(d1.2, d2.2)   fun1(d1.3, d2.3)   NAME1
fun1(d3.1, d5.1)   fun1(d3.2, d5.2)   fun1(d3.3, d5.3)   NAME2
d4.1               d4.2               d4.3               NAME3.

It is also fine if fun1 is also performed on 'single' rows, ie 如果也对“单个”行执行fun1也很好,即

fun1(d1.1, d2.1)   fun1(d1.2, d2.2)   fun1(d1.3, d2.3)   NAME1
fun1(d3.1, d5.1)   fun1(d3.2, d5.2)   fun1(d3.3, d5.3)   NAME2
fun1(d4.1)         fun1(d4.2)         fun1(d4.3)         NAME3.

I tried with 我尝试过

sapply(subset(matrix[,1:3], as.character(matrix[,4])==as.character(listofnames)), fun1)

but of course it does not work. 但是当然不行。 The current problem is on the subsetting as.character(matrix[,4])==as.character(listofnames) , since the two objects have different dimensions, but I am sure this is not the only one. 当前的问题是在子集as.character(matrix[,4])==as.character(listofnames) ,因为这两个对象的尺寸不同,但是我确信这不是唯一的一个。

I tried to look for similar problems but I just found subsetting by specified (numerical) conditions (>3) or by patterns (every group of 7 ordered entries). 我试图寻找类似的问题,但是我发现只是通过指定的(数字)条件(> 3)或模式(每组7个有序条目)来设置子集。 No luck with factors or characters. 没有运气的因素或特征。

I guess there may be something helpful in the plyr package, but I am not able to make it work. 我想plyr软件包中可能有一些有用的plyr ,但是我无法使其工作。 Any suggestion is greatly appreciated! 任何建议,不胜感激!

Update 更新资料

In my case, fun1=min . 就我而言, fun1=min The problem has changed meanwhile: while keeping the data grouped by NAME , I would like to get the min of, say, column 1 in each group and to save the whole row where the min is found, like this: suppose d1.1 < d2.1 and d5.1 < d3.1 , then the matrix 问题同时发生了变化:在将数据按NAME分组的同时,我想获取每个组中第1列的最小值,并保存找到最小值的整个行,如下所示:假设d1.1 < d2.1d5.1 < d3.1 ,则矩阵

d1.1   d1.2   d1.3   NAME1 
d2.1   d2.2   d2.3   NAME1 
d3.1   d3.2   d3.3   NAME2 
d4.1   d4.2   d4.3   NAME3
d5.1   d5.2   d5.3   NAME2

should become 应该成为

d1.1   d1.2   d1.3   NAME1 
d4.1   d4.2   d4.3   NAME3
d5.1   d5.2   d5.3   NAME2

without loss of the other columns. 而不会损失其他列。 I tried playing around with the mutate and summarise arguments as suggested, but keep getting warnings and errors (and actually I do not find the help() very helpful at all). 我尝试按照建议的方法处理mutatesummarise参数,但不断收到警告和错误(实际上我发现help()一点都没有help() )。

You could try: 您可以尝试:

library(dplyr)
dfSelectSummary <- df %>% 
              group_by(name) %>% 
             summarise_each(funs(mean=mean(., na.rm=TRUE), sd=sd(., na.rm=TRUE),
             median=stats::median(., na.rm=TRUE)), starts_with("X"))

dfSelectSummary[,1:4]
#Source: local data frame [3 x 4]

#   name X1_mean  X2_mean  X3_mean
#1 NAME1   4.250 3.333333 4.888889
#2 NAME2   5.375 4.555556 6.000000
#3 NAME3   6.000 8.000000 9.000000

Or you could use data.table 或者您可以使用data.table

library(data.table)
DT <- data.table(df, key='name')
nm1 <- colnames(DT[, as.list(summary(X1[!is.na(X1)])), by=name])[-1]
DTSummary <- DT[,  c(Var=list(nm1),
    lapply(.SD, function(x) summary(x[!is.na(x)]))), by=name]

head(DTSummary,8)
#    name     Var    X1    X2     X3    X4    X5
#1: NAME1    Min.  1.00 0.000  0.000 3.000  0.00
#2: NAME1 1st Qu.  2.00 2.000  1.000 3.750  3.25
#3: NAME1  Median  3.50 3.000  6.000 7.500  5.00
#4: NAME1    Mean  4.25 3.333  4.889 6.375  5.00
#5: NAME1 3rd Qu.  6.00 5.000  8.000 8.250  7.25
#6: NAME1    Max. 10.00 7.000 10.000 9.000 10.00
#7: NAME2    Min.  0.00 0.000  0.000 1.000  1.00
#8: NAME2 1st Qu.  3.75 4.000  4.000 3.000  4.25

Another option would be to try summaryBy from doBy 另一种办法是尝试summaryBydoBy

library(doBy)
 summaryBy(.~name, data=df,
    FUN=function(x) c(mean=mean(x, na.rm=TRUE), var= var(x, na.rm=TRUE),
                    median=median(x, na.rm=TRUE)))

If you have a numeric name column, you may not need to convert the matrix 如果您有numeric名称列,则可能无需转换matrix

 m1 <- as.matrix(cbind(name=as.numeric(df$name), df[,-1]))
 by(m1[,-1], m1[,1], FUN=summary)

data 数据

set.seed(45)
df <- data.frame(name=sample(paste0("NAME", 1:3),20, replace=TRUE),
        matrix(sample(c(NA, 0:10), 20*5, replace=TRUE), ncol=5))

Update 更新资料

If you need the results in the long form and would like to keep the comments column, you could use mutate_each 如果您需要long格式的结果并希望保留comments列,则可以使用mutate_each

 df1 <- df %>% 
           group_by(name) %>% 
           mutate_each(funs(min=min(., na.rm=TRUE)), starts_with("X"))

 colnames(df1)[2:6] <- paste0("Min", colnames(df1)[2:6])
 head(df1,3)
 #Source: local data frame [3 x 7]
 #Groups: name

 #   name MinX1 MinX2 MinX3 MinX4 MinX5 Comments
 #1 NAME2     0     0     0     1     1     Fair
 #2 NAME1     1     0     0     3     0      Bad
 #3 NAME1     1     0     0     3     0     Good

newdata 新数据

  set.seed(45)
  df <- data.frame(name=sample(paste0("NAME", 1:3),20, replace=TRUE),
          matrix(sample(c(NA, 0:10), 20*5, replace=TRUE), ncol=5), 
             Comments=sample(c("Good", "Fair", "Bad", "ugly"), 20, replace=TRUE))

I think I made it! 我想我做到了!

library(dplyr)

df1 <- df %>%
       group_by(NAMES) %>%
       filter(df, X1 == min(X1))

Minimum returned, no data removed. 返回的最小值,未删除任何数据。 I found a similar answer on another thread. 我在另一个线程上找到了类似的答案。 It would have the problem that it returns all rows if multiple minima are present, but this is not my case. 如果存在多个最小值,它将具有返回所有行的问题,但这不是我的情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM