简体   繁体   English

按行列出data.frame并按行在每个部分上应用函数

[英]Subset a data.frame by list and apply function on each part, by rows

This may seem as a typical plyr problem, but I have something different in mind. 这可能看起来像一个典型的plyr问题,但我有一些不同的想法。 Here's the function that I want to optimize (skip the for loop). 这是我想要优化的功能(跳过for循环)。

# dummy data
set.seed(1985)
lst <- list(a=1:10, b=11:15, c=16:20)
m <- matrix(round(runif(200, 1, 7)), 10)
m <- as.data.frame(m)


dfsub <- function(dt, lst, fun) {
    # check whether dt is `data.frame`
    stopifnot (is.data.frame(dt))
    # check if vectors in lst are "whole" / integer
    # vector elements should be column indexes
    is.wholenumber <- function(x, tol = .Machine$double.eps^0.5)  abs(x - round(x)) < tol
    # fall if any non-integers in list
    idx <- rapply(lst, is.wholenumber)
    stopifnot(idx)
    # check for list length
    stopifnot(ncol(dt) == length(idx))
    # subset the data
    subs <- list()
    for (i in 1:length(lst)) {
            # apply function on each part, by row
            subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
    }
    # preserve names
    names(subs) <- names(lst)
    # convert to data.frame
    subs <- as.data.frame(subs)
    # guess what =)
    return(subs)
}

And now a short demonstration... actually, I'm about to explain what I primarily intended to do. 现在是一个简短的演示......实际上,我将要解释我的主要目的。 I wanted to subset a data.frame by vectors gathered in list object. 我想通过list对象中收集的向量对data.frame进行子集化。 Since this is a part of code from a function that accompanies data manipulation in psychological research, you can consider m as a results from personality questionnaire (10 subjects, 20 vars). 由于这是伴随心理研究中数据操作的函数的代码的一部分,因此您可以将m视为人格问卷(10个科目,20个变量)的结果。 Vectors in list hold column indexes that define questionnaire subscales (eg personality traits). 列表中的向量包含定义问卷子量表(例如人格特征)的列索引。 Each subscale is defined by several items (columns in data.frame ). 每个子量表由几个项目( data.frame列)定义。 If we presuppose that the score on each subscale is nothing more than sum (or some other function) of row values (results on that part of questionnaire for each subject), you could run: 如果我们预先假定每个子量表上的得分只不过是行值的sum (或其他一些函数)(每个主题的问卷调查表的结果),您可以运行:

> dfsub(m, lst, sum)
    a  b  c
1  46 20 24
2  41 24 21
3  41 13 12
4  37 14 18
5  57 18 25
6  27 18 18
7  28 17 20
8  31 18 23
9  38 14 15
10 41 14 22

I took a glance at this function and I must admit that this little loop isn't spoiling the code at all... BUT, if there's an easier/efficient way of doing this, please, let me know! 我瞥了一眼这个函数,我必须承认这个小循环并没有破坏代码...但是,如果有一个更简单/有效的方法,请告诉我!

I'd take a different approach and keep everything as data frames so that you can use merge and ddply. 我采取不同的方法,并将所有内容保存为数据框,以便您可以使用merge和ddply。 I think you'll find this approach is a little more general, and it's easier to check that each step is performed correctly. 我认为你会发现这种方法更为通用,并且更容易检查每个步骤是否正确执行。

# Convert everything to long data frames
m$id <- 1:nrow(m)

library(reshape)
obs <- melt(m, id = "id")
obs$variable <- as.numeric(gsub("V", "", obs$variable))

varinfo <- melt(lst)
names(varinfo) <- c("variable", "scale")

# Merge and summarise
obs <- merge(obs, varinfo, by = "variable")

ddply(obs, c("id", "scale"), summarise, 
  mean = mean(value), 
  sum = sum(value))

after loading the plyr package, replace 装入plyr包装后,更换

subs <- list()
    for (i in 1:length(lst)) {
            # apply function on each part, by row
            subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
    }

with

subs <- llply(lst,function(x) apply(dt[,x],1,fun))

For your specific example, a one-line solution is sapply(lst,function(x) rowSums(m[,x])) (although you might add some more lines to check for valid input and put in the column names). 对于您的具体示例,单行解决方案是sapply(lst,function(x) rowSums(m[,x])) (尽管您可能会添加更多行来检查有效输入并放入列名称)。

Do you have other, more general, applications in mind? 您是否考虑过其他更通用的应用程序? Or is this possibly a case of YAGNI ? 或者这可能是YAGNI的情况?

@Hadley, I've checked your response since it's quite straightforward and easy for bookkeeping (besides the fact it's more general-purpose-solution). @哈德利,我已经检查了你的回答,因为它非常简单易记(除了它是更通用的解决方案)。 However, here's my not-so-long script that does the thing and requires only base package (which is trivial since I install plyr and reshape just after installing R). 但是,这是我不太长的脚本,只需要base包(这是非常简单的,因为我安装plyrreshape安装R后reshape )。 Now, here's the source: 现在,这是来源:

dfsub <- function(dt, lst, fun) {
        # check whether dt is `data.frame`
        stopifnot (is.data.frame(dt))
        # convert data.frame factors to numeric
        dt <- as.data.frame(lapply(dt, as.numeric))
        # check if vectors in lst are "whole" / integer
        # vector elements should be column indexes
        is.wholenumber <- function(x, tol = .Machine$double.eps^0.5)  abs(x - round(x)) < tol
        # fall if any non-integers in list
        idx <- rapply(lst, is.wholenumber)
        stopifnot(idx)
        # check for list length
        stopifnot(ncol(dt) == length(idx))
        # subset the data
        subs <- list()
        for (i in 1:length(lst)) {
                # apply function on each part, by row
                subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
        }
        names(subs) <- names(lst)
        # convert to data.frame
        subs <- as.data.frame(subs)
        # guess what =)
        return(subs)
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM