[英]Subsetting by two factors, all levels, with a simple code
I'm aware that this question is simple, but couldn't find a solution without creating step objects, and I want a one-line code, or one as simplest as it could be. 我知道这个问题很简单,但是在没有创建步骤对象的情况下找不到解决方案,我想要一个单行代码,或者尽可能简单的代码。
Suppose I have a data frame called df with columns x , y , z : 假设我有一个名为df的数据框,其中列为x , y , z :
x<-c(rep('place1',33),rep('place2',33),rep('place3',34))
y<-sample(c('type1','type2','type3','type4','type5'),100,replace=T)
z<-sample(40:80,100,replace=T)
df<-data.frame(x,y,z)
I would like to get all subsets possible of z for each combination of levels of x and y (type1 in place1, type2 in place1, type3 in place1...type4 in place3 and type5 in place3). 我希望为x和y的每个组合得到z的所有子集(place1中的type1,place1中的type2,place1中的type3 ... place3中的type4和place3中的type5)。 Something like this: 像这样的东西:
[[place1]]
[type1]
[1] 57 73 74 47 52 61
[type2]
[1] 72 76 64 62 73 75
...
[type5]
...
[[place3]]
[type1]
...
[type5]
In the case this is possible, how could I access each subset? 在可能的情况下,我如何访问每个子集?
I've tried a nested split
inside an lapply
, without success. 我已尝试在lapply
嵌套split
,但没有成功。
Sorry for this simple question, but couldn't find a suitable solution. 对不起这个简单的问题,但找不到合适的解决方案。
Any help would be appreciated. 任何帮助,将不胜感激。
Here is one way. 这是一种方式。 You split your df using the variable, x. 你使用变量x分割你的df。 Then, you split each data frame using split again with the variable, y. 然后,使用变量y再次拆分拆分每个数据帧。 In this way, you can subset your data in a way you want.I left a bit of trimmed outcome in the end. 通过这种方式,您可以按照自己想要的方式对数据进行子集化。最后我留下了一些修剪后的结果。
lapply(split(df, f = df$x), function(x) split(x, f = x$y)
#$place1
#$place1$type1
# x y z
#5 place1 type1 46
#7 place1 type1 41
#$place1$type2
# x y z
#3 place1 type2 44
#4 place1 type2 59
If you just want the values for z, you can do something like this: 如果你只想要z的值,你可以这样做:
lapply(split(df, f = df$x), function(x) split(x$z, f = x$y))
#$place1
#$place1$type1
#[1] 46 41 50 59 54 51 66 70
#$place1$type2
#[1] 44 59 60 53 74 46 67 70
#$place1$type3
#[1] 63 70 80 44 73 74 58
#$place1$type4
#[1] 45 67 52 72 45 48 79 65
#$place1$type5
#[1] 75 54
EDIT 编辑
Seeing the link provided by @user295691, you could do the following as well. 查看@ user295691提供的链接,您也可以执行以下操作。
split(df$z, interaction(df$x,df$y))
If you want each vector with z values, you could do: 如果你想要每个矢量都有z值,你可以这样做:
list2env(split(df$z, interaction(df$x,df$y)), .GlobalEnv)
EDIT2 EDIT2
The OP wanted to run stats using this data. OP想要使用这些数据运行统计数据。 I, therefore, thought it would be a good idea to leave the following. 因此,我认为留下以下内容是个好主意。 If you need to create a data frame with different length of vectors in a list, you could do something like this. 如果需要在列表中创建具有不同向量长度的数据框,则可以执行类似的操作。 listvectors2df
let you create a data frame with NA. listvectors2df
允许您使用NA创建数据框。
ana <- split(df$z, interaction(df$x,df$y))
# I used a good answer in this post and wrote the following.
#http://stackoverflow.com/questions/15201305/how-to-convert-a-list-consisting-of-vector-of-different-lengths-to-a-usable-data
listvectors2df <- function(l){
n.obs <- sapply(l, length)
seq.max <- seq_len(max(n.obs))
mydf <- data.frame(sapply(l, "[", i = seq.max), stringsAsFactors = FALSE)
}
bob <- listvectors2df(ana)
Can also use split with interaction: 也可以使用拆分与交互:
split(df, interaction(x,y))
$place1.type1
x y z
6 place1 type1 57
25 place1 type1 55
27 place1 type1 55
28 place1 type1 75
29 place1 type1 54
$place2.type1
x y z
36 place2 type1 70
42 place2 type1 69
45 place2 type1 78
57 place2 type1 79
59 place2 type1 46
60 place2 type1 45
63 place2 type1 73
64 place2 type1 79
$place3.type1
x y z
85 place3 type1 54
To access each element: 要访问每个元素:
> ll = split(df, interaction(x,y))
>
> ll[[1]]
x y z
6 place1 type1 57
25 place1 type1 55
27 place1 type1 55
28 place1 type1 75
29 place1 type1 54
>
> ll[[2]]
x y z
36 place2 type1 70
42 place2 type1 69
45 place2 type1 78
57 place2 type1 79
59 place2 type1 46
60 place2 type1 45
63 place2 type1 73
64 place2 type1 79
data.table can also be used: data.table也可以使用:
library(data.table)
dtt = data.table(df)
dtt[order(x,y),list(meanz=mean(z), maxz=max(z), sumz=sum(z)),by=list(x,y)]
x y meanz maxz sumz
1: place1 type1 63.11111 80 568
2: place1 type2 68.12500 79 545
3: place1 type3 58.80000 76 294
4: place1 type4 59.83333 79 359
5: place1 type5 59.40000 80 297
6: place2 type1 55.85714 69 391
7: place2 type2 59.71429 71 418
8: place2 type3 61.00000 76 305
9: place2 type4 53.63636 71 590
10: place2 type5 44.66667 46 134
11: place3 type1 62.16667 74 373
12: place3 type2 63.42857 80 444
13: place3 type3 64.00000 77 384
14: place3 type4 61.28571 80 429
15: place3 type5 51.00000 60 408
There are a couple of solutions. 有几种解决方案。 The first is the lapply/split that jazzurro has provided. 第一个是jazzurro提供的lapply / split。 You could also combine the factors into a single factor, eg 您还可以将这些因素组合成单个因子,例如
> split(df, paste(df$x, df$y))
$`place1 type1`
x y z
3 place1 type1 57
24 place1 type1 54
$`place1 type2`
x y z
1 place1 type2 67
6 place1 type2 75
7 place1 type2 72
12 place1 type2 57
...
The other solution would be to use a library that has intrinsic support for multi-level grouping, like data.tables
or plyr
/ dplyr
. 另一种解决方案是使用对多级分组具有内在支持的库,如data.tables
或plyr
/ dplyr
。 In dplyr
, the operation would look like (including the summary, in this case the mean and max of the third column) 在dplyr
,操作看起来像(包括摘要,在这种情况下是第三列的平均值和最大值)
> df %>% group_by(x, y) %>% summarise(mean(z), max(z))
Source: local data frame [15 x 4]
Groups: x
x y mean(z) max(z)
1 place1 type1 55.50000 57
2 place1 type2 65.50000 80
3 place1 type3 60.40000 78
4 place1 type4 57.12500 73
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.