简体   繁体   中英

Extract multiple data.frames from one with selection criteria

Let this be my data set:

df <- data.frame(x1 = runif(1000), x2 = runif(1000), x3 = runif(1000), 
             split = sample( c('SPLITMEHERE', 'OBS'), 1000, replace=TRUE, prob=c(0.04, 0.96) ))

So, I have some variables (in my case, 15), and criteria by which I want to split the data.frame into multiple data.frames.

My criteria is the following: each other time the 'SPLITMEHERE' appears I want to take all the values, or all 'OBS' below it and get a data.frame from just these observations. So, if there's 20 'SPLITMEHERE's in starting data.frame, I want to end up with 10 data.frames in the end.

I know it sounds confusing and like it doesn't have much sense, but this is the result from extracting the raw numbers from an awfully dirty .txt file to obtain meaningful data. Basically, every 'SPLITMEHERE' denotes the new table in this .txt file, but each county is divided into two tables, so I want one table (data.frame) for each county.

In the hope I will make it more clear, here is the example of exactly what I need. Let's say the first 20 observations are:

             x1          x2           x3       split
1    0.307379064 0.400526799 0.2898194543         SPLITMEHERE
2    0.465236674 0.915204924 0.5168274657         OBS
3    0.063814420 0.110380201 0.9564822116         OBS
4    0.401881416 0.581895095 0.9443995396         OBS
5    0.495227871 0.054014926 0.9059893533         SPLITMEHERE
6    0.091463620 0.945452614 0.9677482590         OBS
7    0.876123151 0.702328031 0.9739113525         OBS
8    0.413120761 0.441159673 0.4725571219         OBS
9    0.117764512 0.390644966 0.3511555807         OBS
10   0.576699384 0.416279417 0.8961428872         OBS
11   0.854786077 0.164332814 0.1609375612         OBS
12   0.336853841 0.794020157 0.0647337821         SPLITMEHERE
13   0.122690541 0.700047133 0.9701538396         OBS
14   0.733926139 0.785366852 0.8938749305         OBS
15   0.520766503 0.616765349 0.5136788010         OBS
16   0.628549288 0.027319848 0.4509875809         OBS
17   0.944188977 0.913900539 0.3767973795         OBS
18   0.723421337 0.446724318 0.0925365961         OBS
19   0.758001243 0.530991725 0.3916394396         SPLITMEHERE
20   0.888036748 0.862066601 0.6501050976         OBS

What I would like to get is this:

data.frame1:

1    0.465236674 0.915204924 0.5168274657         OBS
2    0.063814420 0.110380201 0.9564822116         OBS
3    0.401881416 0.581895095 0.9443995396         OBS
4    0.091463620 0.945452614 0.9677482590         OBS
5    0.876123151 0.702328031 0.9739113525         OBS
6    0.413120761 0.441159673 0.4725571219         OBS
7    0.117764512 0.390644966 0.3511555807         OBS
8    0.576699384 0.416279417 0.8961428872         OBS
9    0.854786077 0.164332814 0.1609375612         OBS

And

data.frame2:
    1   0.122690541 0.700047133 0.9701538396         OBS
    2   0.733926139 0.785366852 0.8938749305         OBS
    3   0.520766503 0.616765349 0.5136788010         OBS
    4   0.628549288 0.027319848 0.4509875809         OBS
    5   0.944188977 0.913900539 0.3767973795         OBS
    6   0.723421337 0.446724318 0.0925365961         OBS
    7   0.888036748 0.862066601 0.6501050976         OBS

Therefore, split column only shows me where to split, data in columns where 'SPLITMEHERE' is written is meaningless. But, this is no bother, as I can delete this rows later, the point is in separating multiple data.frames based on this criteria.

Obviously, just the split() function and filter() from dplyr wouldn't suffice here. The real problem is that the lines which are supposed to separate the data.frames (ie every other 'SPLITMEHERE') do not appear in regular fashion, but just like in my above example. Once there is a gap of 3 lines, and other times it could be 10 or 15 lines.

Is there any way to extract this efficiently in R?

The hardest part of the problem is creating the groups. Once we have the proper groupings, it's easy enough to use a split to get your result.

With that said, you can use a cumsum for the groups. Here I divide the cumsum by 2 and use a ceiling so that any groups of 2 SPLITMEHERE 's will be collapsed into one. I also use an ifelse to exclude the rows with SPLITMEHERE :

df$group <- ifelse(df$split != "SPLITMEHERE", ceiling(cumsum(df$split=="SPLITMEHERE")/2), 0)
res <- split(df, df$group)

The result is a list with a dataframe for each group . The groups with 0 are ones you want throw out.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM