简体   繁体   English

通过两个因素(所有级别)进行子集,使用简单的代码

[英]Subsetting by two factors, all levels, with a simple code

I'm aware that this question is simple, but couldn't find a solution without creating step objects, and I want a one-line code, or one as simplest as it could be. 我知道这个问题很简单,但是在没有创建步骤对象的情况下找不到解决方案,我想要一个单行代码,或者尽可能简单的代码。

Suppose I have a data frame called df with columns x , y , z : 假设我有一个名为df的数据框,其中列为xyz

x<-c(rep('place1',33),rep('place2',33),rep('place3',34))
y<-sample(c('type1','type2','type3','type4','type5'),100,replace=T)
z<-sample(40:80,100,replace=T)
df<-data.frame(x,y,z)

I would like to get all subsets possible of z for each combination of levels of x and y (type1 in place1, type2 in place1, type3 in place1...type4 in place3 and type5 in place3). 我希望为xy的每个组合得到z的所有子集(place1中的type1,place1中的type2,place1中的type3 ... place3中的type4和place3中的type5)。 Something like this: 像这样的东西:

[[place1]]
[type1]
[1] 57 73 74 47 52 61

[type2]
[1] 72 76 64 62 73 75
...

[type5]
...

[[place3]]
[type1]
...

[type5]

In the case this is possible, how could I access each subset? 在可能的情况下,我如何访问每个子集?

I've tried a nested split inside an lapply , without success. 我已尝试在lapply嵌套split ,但没有成功。

Sorry for this simple question, but couldn't find a suitable solution. 对不起这个简单的问题,但找不到合适的解决方案。

Any help would be appreciated. 任何帮助,将不胜感激。

Here is one way. 这是一种方式。 You split your df using the variable, x. 你使用变量x分割你的df。 Then, you split each data frame using split again with the variable, y. 然后,使用变量y再次拆分拆分每个数据帧。 In this way, you can subset your data in a way you want.I left a bit of trimmed outcome in the end. 通过这种方式,您可以按照自己想要的方式对数据进行子集化。最后我留下了一些修剪后的结果。

lapply(split(df, f = df$x), function(x) split(x, f = x$y)

#$place1
#$place1$type1
#        x     y  z
#5  place1 type1 46
#7  place1 type1 41

#$place1$type2
#        x     y  z
#3  place1 type2 44
#4  place1 type2 59

If you just want the values for z, you can do something like this: 如果你只想要z的值,你可以这样做:

lapply(split(df, f = df$x), function(x) split(x$z, f = x$y))

#$place1
#$place1$type1
#[1] 46 41 50 59 54 51 66 70

#$place1$type2
#[1] 44 59 60 53 74 46 67 70

#$place1$type3
#[1] 63 70 80 44 73 74 58

#$place1$type4
#[1] 45 67 52 72 45 48 79 65

#$place1$type5
#[1] 75 54

EDIT 编辑

Seeing the link provided by @user295691, you could do the following as well. 查看@ user295691提供的链接,您也可以执行以下操作。

split(df$z, interaction(df$x,df$y))

If you want each vector with z values, you could do: 如果你想要每个矢量都有z值,你可以这样做:

list2env(split(df$z, interaction(df$x,df$y)), .GlobalEnv)

EDIT2 EDIT2

The OP wanted to run stats using this data. OP想要使用这些数据运行统计数据。 I, therefore, thought it would be a good idea to leave the following. 因此,我认为留下以下内容是个好主意。 If you need to create a data frame with different length of vectors in a list, you could do something like this. 如果需要在列表中创建具有不同向量长度的数据框,则可以执行类似的操作。 listvectors2df let you create a data frame with NA. listvectors2df允许您使用NA创建数据框。

ana <- split(df$z, interaction(df$x,df$y))

# I used a good answer in this post and wrote the following.
#http://stackoverflow.com/questions/15201305/how-to-convert-a-list-consisting-of-vector-of-different-lengths-to-a-usable-data

listvectors2df <- function(l){

    n.obs <- sapply(l, length)
    seq.max <- seq_len(max(n.obs))
    mydf <- data.frame(sapply(l, "[", i = seq.max), stringsAsFactors = FALSE)

}

bob <- listvectors2df(ana)

Can also use split with interaction: 也可以使用拆分与交互:

split(df, interaction(x,y))
$place1.type1
        x     y  z
6  place1 type1 57
25 place1 type1 55
27 place1 type1 55
28 place1 type1 75
29 place1 type1 54

$place2.type1
        x     y  z
36 place2 type1 70
42 place2 type1 69
45 place2 type1 78
57 place2 type1 79
59 place2 type1 46
60 place2 type1 45
63 place2 type1 73
64 place2 type1 79

$place3.type1
        x     y  z
85 place3 type1 54

To access each element: 要访问每个元素:

> ll = split(df, interaction(x,y))
> 
> ll[[1]]
        x     y  z
6  place1 type1 57
25 place1 type1 55
27 place1 type1 55
28 place1 type1 75
29 place1 type1 54
> 
> ll[[2]]
        x     y  z
36 place2 type1 70
42 place2 type1 69
45 place2 type1 78
57 place2 type1 79
59 place2 type1 46
60 place2 type1 45
63 place2 type1 73
64 place2 type1 79

data.table can also be used: data.table也可以使用:

library(data.table)
dtt = data.table(df)

dtt[order(x,y),list(meanz=mean(z), maxz=max(z), sumz=sum(z)),by=list(x,y)]
         x     y    meanz maxz sumz
 1: place1 type1 63.11111   80  568
 2: place1 type2 68.12500   79  545
 3: place1 type3 58.80000   76  294
 4: place1 type4 59.83333   79  359
 5: place1 type5 59.40000   80  297
 6: place2 type1 55.85714   69  391
 7: place2 type2 59.71429   71  418
 8: place2 type3 61.00000   76  305
 9: place2 type4 53.63636   71  590
10: place2 type5 44.66667   46  134
11: place3 type1 62.16667   74  373
12: place3 type2 63.42857   80  444
13: place3 type3 64.00000   77  384
14: place3 type4 61.28571   80  429
15: place3 type5 51.00000   60  408

There are a couple of solutions. 有几种解决方案。 The first is the lapply/split that jazzurro has provided. 第一个是jazzurro提供的lapply / split。 You could also combine the factors into a single factor, eg 您还可以将这些因素组合成单个因子,例如

> split(df, paste(df$x, df$y))
$`place1 type1`
        x     y  z
3  place1 type1 57
24 place1 type1 54

$`place1 type2`
        x     y  z
1  place1 type2 67
6  place1 type2 75
7  place1 type2 72
12 place1 type2 57
...

The other solution would be to use a library that has intrinsic support for multi-level grouping, like data.tables or plyr / dplyr . 另一种解决方案是使用对多级分组具有内在支持的库,如data.tablesplyr / dplyr In dplyr , the operation would look like (including the summary, in this case the mean and max of the third column) dplyr ,操作看起来像(包括摘要,在这种情况下是第三列的平均值和最大值)

> df %>% group_by(x, y) %>% summarise(mean(z), max(z))
Source: local data frame [15 x 4]
Groups: x

        x     y  mean(z) max(z)
1  place1 type1 55.50000     57
2  place1 type2 65.50000     80
3  place1 type3 60.40000     78
4  place1 type4 57.12500     73
...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM