简体   繁体   English

从具有选择标准的一个中提取多个data.frames

[英]Extract multiple data.frames from one with selection criteria

Let this be my data set: 让这成为我的数据集:

df <- data.frame(x1 = runif(1000), x2 = runif(1000), x3 = runif(1000), 
             split = sample( c('SPLITMEHERE', 'OBS'), 1000, replace=TRUE, prob=c(0.04, 0.96) ))

So, I have some variables (in my case, 15), and criteria by which I want to split the data.frame into multiple data.frames. 所以,我有一些变量(在我的例子中,15),以及我想将data.frame分成多个data.frames的标准。

My criteria is the following: each other time the 'SPLITMEHERE' appears I want to take all the values, or all 'OBS' below it and get a data.frame from just these observations. 我的标准如下:每次出现'SPLITMEHERE'我想要获取所有值,或者它下面的所有'OBS',并从这些观察中得到一个data.frame。 So, if there's 20 'SPLITMEHERE's in starting data.frame, I want to end up with 10 data.frames in the end. 所以,如果在启动data.frame时有20'SPLITMEHERE,我想最终得到10个data.frames。

I know it sounds confusing and like it doesn't have much sense, but this is the result from extracting the raw numbers from an awfully dirty .txt file to obtain meaningful data. 我知道这听起来很混乱,并且它没有多大意义,但这是从一个非常脏的.txt文件中提取原始数据以获得有意义的数据的结果。 Basically, every 'SPLITMEHERE' denotes the new table in this .txt file, but each county is divided into two tables, so I want one table (data.frame) for each county. 基本上,每个'SPLITMEHERE'表示此.txt文件中的新表,但每个县分为两个表,所以我想为每个县一个表(data.frame)。

In the hope I will make it more clear, here is the example of exactly what I need. 希望我能说得更清楚,以下是我需要的例子。 Let's say the first 20 observations are: 假设前20个观察结果是:

             x1          x2           x3       split
1    0.307379064 0.400526799 0.2898194543         SPLITMEHERE
2    0.465236674 0.915204924 0.5168274657         OBS
3    0.063814420 0.110380201 0.9564822116         OBS
4    0.401881416 0.581895095 0.9443995396         OBS
5    0.495227871 0.054014926 0.9059893533         SPLITMEHERE
6    0.091463620 0.945452614 0.9677482590         OBS
7    0.876123151 0.702328031 0.9739113525         OBS
8    0.413120761 0.441159673 0.4725571219         OBS
9    0.117764512 0.390644966 0.3511555807         OBS
10   0.576699384 0.416279417 0.8961428872         OBS
11   0.854786077 0.164332814 0.1609375612         OBS
12   0.336853841 0.794020157 0.0647337821         SPLITMEHERE
13   0.122690541 0.700047133 0.9701538396         OBS
14   0.733926139 0.785366852 0.8938749305         OBS
15   0.520766503 0.616765349 0.5136788010         OBS
16   0.628549288 0.027319848 0.4509875809         OBS
17   0.944188977 0.913900539 0.3767973795         OBS
18   0.723421337 0.446724318 0.0925365961         OBS
19   0.758001243 0.530991725 0.3916394396         SPLITMEHERE
20   0.888036748 0.862066601 0.6501050976         OBS

What I would like to get is this: 我想得到的是:

data.frame1:

1    0.465236674 0.915204924 0.5168274657         OBS
2    0.063814420 0.110380201 0.9564822116         OBS
3    0.401881416 0.581895095 0.9443995396         OBS
4    0.091463620 0.945452614 0.9677482590         OBS
5    0.876123151 0.702328031 0.9739113525         OBS
6    0.413120761 0.441159673 0.4725571219         OBS
7    0.117764512 0.390644966 0.3511555807         OBS
8    0.576699384 0.416279417 0.8961428872         OBS
9    0.854786077 0.164332814 0.1609375612         OBS

And

data.frame2:
    1   0.122690541 0.700047133 0.9701538396         OBS
    2   0.733926139 0.785366852 0.8938749305         OBS
    3   0.520766503 0.616765349 0.5136788010         OBS
    4   0.628549288 0.027319848 0.4509875809         OBS
    5   0.944188977 0.913900539 0.3767973795         OBS
    6   0.723421337 0.446724318 0.0925365961         OBS
    7   0.888036748 0.862066601 0.6501050976         OBS

Therefore, split column only shows me where to split, data in columns where 'SPLITMEHERE' is written is meaningless. 因此,拆分列仅显示拆分位置,写入“SPLITMEHERE”的列中的数据无意义。 But, this is no bother, as I can delete this rows later, the point is in separating multiple data.frames based on this criteria. 但是,这没有什么麻烦,因为我可以稍后删除这些行,重点在于根据此条件分离多个data.frames。

Obviously, just the split() function and filter() from dplyr wouldn't suffice here. 显然,只是来自dplyrsplit()函数和filter()dplyr是不够的。 The real problem is that the lines which are supposed to separate the data.frames (ie every other 'SPLITMEHERE') do not appear in regular fashion, but just like in my above example. 真正的问题是,应该分隔data.frames(即每隔一个'SPLITMEHERE')的行不会以常规方式出现,但就像上面的例子一样。 Once there is a gap of 3 lines, and other times it could be 10 or 15 lines. 一旦有3行间隙,其他时间可能是10行或15行。

Is there any way to extract this efficiently in R? 有没有办法在R中有效地提取这个?

The hardest part of the problem is creating the groups. 问题最难的部分是创建群组。 Once we have the proper groupings, it's easy enough to use a split to get your result. 一旦我们有了正确的分组,就可以轻松地使用split来获得结果。

With that said, you can use a cumsum for the groups. 话虽如此,您可以为组使用cumsum Here I divide the cumsum by 2 and use a ceiling so that any groups of 2 SPLITMEHERE 's will be collapsed into one. 在这里,我将cumsum除以2并使用ceiling以便任何2个SPLITMEHERE的组将折叠成一个。 I also use an ifelse to exclude the rows with SPLITMEHERE : 我也用一个ifelse与排除行SPLITMEHERE

df$group <- ifelse(df$split != "SPLITMEHERE", ceiling(cumsum(df$split=="SPLITMEHERE")/2), 0)
res <- split(df, df$group)

The result is a list with a dataframe for each group . 结果是一个列表,其中包含每个group的数据框。 The groups with 0 are ones you want throw out. 0的组是你想扔掉的组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM