简体   繁体   English

解释R tapply描述

[英]Explain R tapply description

I understand what tapply() does in R.我了解 tapply() 在 R 中的作用。 However, I cannot parse this description of it from the documentaion:但是,我无法从文档中解析它的描述:


Apply a Function Over a "Ragged" Array

Description:

     Apply a function to each cell of a ragged array, that is to each
     (non-empty) group of values given by a unique combination of the
     levels of certain factors.

Usage:

     tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)

When I think of tapply, I think of group by in sql.当我想到 tapply 时,我会想到 sql 中的 group by。 You group values in X together by its parallel factor levels in INDEX and apply FUN to those groups.您可以通过 INDEX 中的平行因子水平将 X 中的值组合在一起,并将 FUN 应用于这些组。 I have read the description of tapply 100 times and still can't figure out how what it says maps to how I understand tapply.我已经阅读了 tapply 的描述 100 次,但仍然无法弄清楚它所说的内容如何映射到我对 tapply 的理解。 Perhaps someone can help me parse it?也许有人可以帮我解析它?

Let's see what the R documentation says on the subject:让我们看看R 文档关于这个主题的内容:

The combination of a vector and a labelling factor is an example of what is sometimes called a ragged array, since the subclass sizes are possibly irregular.向量和标记因子的组合是有时称为参差不齐的数组的一个示例,因为子类的大小可能是不规则的。 When the subclass sizes are all the same the indexing may be done implicitly and much more efficiently, as we see in the next section.当子类大小都相同时,索引可以隐式完成并且效率更高,正如我们在下一节中看到的那样。

The list of factors you supply via INDEX together specify a collection of subsets of X , of possibly different lengths (hence, the 'ragged' descriptor).您通过INDEX提供的因子列表一起指定了X的子集的集合,这些子集可能具有不同的长度(因此,“衣衫褴褛”的描述符)。 And then FUN is applied to each subset.然后将FUN应用于每个子集。

EDIT: @Joris makes an excellent point in the comments.编辑:@Joris 在评论中提出了一个很好的观点。 It may be helpful to think of tapply(X,Y,...) as a wrapper for sapply(split(X,Y),...) in that if Y is a list of grouping factors, it builds a new, single grouping factor based on their unique levels, splits X accordingly and applies FUN to each piece.tapply(X,Y,...)视为sapply(split(X,Y),...)的包装器可能会有所帮助,因为如果 Y 是分组因子的列表,它会构建一个新的,基于其独特级别的单个分组因子,相应地拆分 X 并将 FUN 应用于每个部分。

EDIT: Here's an illustrative example:编辑:这是一个说明性示例:

library(lattice)
library(plyr)
set.seed(123)

#Make this example unbalanced
dat <- barley[sample(1:120,50),]

#Suppose we want the avg yield by year/site:
table(dat$year,dat$site)

#That's what they mean by 'ragged' array; there are different
# numbers of obs at each comb of levels

#In plyr we could use ddply:
ddply(dat,.(year,site),.fun=function(x){mean(x$yield)})

#Which gives the same result (listed in a diff order) as:
melt(tapply (dat$yield, list (dat$year, dat$site), mean))

@joran's great answer helped me understand it (so please vote for his - I would have added it as comment if it wasn't too long for that), but this may be of help to some: @joran 的出色回答帮助我理解了它(所以请投票给他 - 如果它不是太长的话,我会添加它作为评论),但这可能对某些人有帮助:

In quite a few languages, you have twodimensional arrays.在相当多的语言中,你有二维 arrays。 Depending on the language, these arrays have fixed dimensions (ie: each row has the same number of columns), or some languages allow the number of items per row to differ.根据语言,这些 arrays 具有固定尺寸(即:每行具有相同的列数),或者某些语言允许每行的项目数不同。 So instead of:所以而不是:

A: 1  2  3
B: 4  5  6
C: 7  8  9

You could get something like你可以得到类似的东西

A: 1  3
B: 4  5  6
C: 8

This is called a ragged array because, well, the right side of it looks ragged.这被称为参差不齐的数组,因为它的右侧看起来参差不齐。 In typical R-style, we might represent this as two vectors:在典型的 R 风格中,我们可以将其表示为两个向量:

values<-c(1,3,4,5,6,8)
names<-c("A", "A", "B", "B", "B", "C")

So tapply with these two vectors as the first parameters indeed allows us to apply this function to each 'row' of our ragged array.因此,使用这两个向量作为第一个参数确实允许我们将此tapply应用于我们参差不齐的数组的每个“行”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM