简体   繁体   English

按列子集数据框并返回这些子集的列表

[英]Subsetting a data frame by columns and returning a list of those subsets

I want to take a data frame like this one: 我想采取像这样的数据框:

df <- data.frame(
  SortCol1 = rep(c("One", "Two", "Three", "Four"), times = 5),
  SortCol2 = rep(c("A", "B"), times = 10),
  Arb1 = rep(c(1,0,1,1,0), times = 4),
  Arb2 = rep(c(0,1,1,0,0), times = 4)
)

   SortCol1 SortCol2 Arb1 Arb2
1       One        A    1    0
2       Two        B    0    1
3     Three        A    1    1
4      Four        B    1    0
5       One        A    0    0
6       Two        B    1    0
7     Three        A    0    1
8      Four        B    1    1
9       One        A    1    0
10      Two        B    0    0
11    Three        A    1    0
12     Four        B    0    1
13      One        A    1    1
14      Two        B    1    0
15    Three        A    0    0
16     Four        B    1    0
17      One        A    0    1
18      Two        B    1    1
19    Three        A    1    0
20     Four        B    0    0

Then subset it by SortCol1 and SortCol2 to return a list of all subsetted data frames. 然后通过SortCol1SortCol2进行子集SortCol1 ,以返回所有子集化数据帧的列表。

I have done something similar to this many times before using ddply when I want to apply a function to the Arb1 and Arb2 columns. 我做了使用前类似这样的很多次的东西ddply当我想给一个函数应用到Arb1Arb2列。

eg I know that 我知道的

ddply(df, c("SortCol1", "SortCol2"), numcolwise(sum))

Will subset based on the two columns I want, and return a minimal frame which has those columns and the sum function applied. 将基于我想要的两列的子集,并返回具有这些列和应用的sum函数的最小帧。

What I want is rather than applying a function to those columns, just have each subset returned as an element of a list. 我想要的不是将函数应用于这些列,而是将每个子集作为列表的元素返回。

Pretend the function that does that is called ddply_list . 假装执行该操作的函数称为ddply_list I would hope for something akin to 我希望有类似的东西

ddply_list(df, c("SortCol1", "SortCol2"))

Which would return a list whose elements would be the data frames (which I have manually created for now): 哪个会返回一个列表,其元素将是数据框(我现在手动创建):

df[df$SortCol1=="One" & df$SortCol2 == "A",]
   SortCol1 SortCol2 Arb1 Arb2
1       One        A    1    0
5       One        A    0    0
9       One        A    1    0
13      One        A    1    1
17      One        A    0    1

df[df$SortCol1=="Two" & df$SortCol2 == "B",]
   SortCol1 SortCol2 Arb1 Arb2
2       Two        B    0    1
6       Two        B    1    0
10      Two        B    0    0
14      Two        B    1    0
18      Two        B    1    1

etc for all combinations of SortCol1 and SortCol2 . 对于SortCol1SortCol2所有组合, SortCol2

If there's a function list that already, perfect! 如果有一个功能列表已经完美! If not, any advice for how to get towards this solution would be awesome! 如果没有,任何关于如何实现这个解决方案的建议都会很棒!

The main bit I'm not sure on, is the simplest way to return all subsets of a data frame (subsetted by columns) as a list of data frames. 我不确定的主要位是将数据帧的所有子集(由列子集化)作为数据帧列表返回的最简单方法。

To put it in another way, the ddply documentation described the .fun argument as... function to apply to each piece . 换句话说, ddply文档将.fun参数描述为... 函数以应用于每个部分 I think what I want is a way of just returning each 'piece' as an element of a list (preferably with the columns used for subsetting still attached). 我认为我想要的是一种将每个“片段”作为列表元素返回的方法(最好是用于子集化的列仍然附加)。

Turns out it's very simple: 原来这很简单:

split(df, df[c("SortCol1", "SortCol2"], drop=TRUE)

Answer stolen from here: Automatically subset data frame by factor 从此处窃取的答案: 按因子自动对数据进行子集

Usage: 用法:

split(x, f, drop = FALSE, ...)

Where x is a vector or dataframe and y is a factor or list of factors for defining groups. 其中x是向量或数据帧, y是用于定义组的因子或因子列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM