简体   繁体   中英

R - How can I subset rows of a dataframe using an index of integers?

I am trying to find a method for subsetting or slicing a dataframe based on each occurrence of a certain string appearing in one column/variable - eg I would like to delete all rows between two occurrences of the string. This problem is similar to this question BUT the crucial difference is that I have multiple occurrences of the string and would like to delete the rows between each pair of occurrences. I'm an R dunce and I can't find a way to apply the solution to an index of more than two integers in any elegant kind of way.

Say I have the following dataframe:

a <- c("one", "here is a string", "two", "three", "four", "another string", "five", "six", "yet another string", "seven", "last string")
b <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k")
c <- c("type1", "type1", "type1", "type1", "type1", "type1", "type2", "type2", "type2", "type2", "type2")

df <- data.frame(a,b,c)

Which gives the following:

print(df)

                 a b     c
1                 one a type1
2    here is a string b type1
3                 two c type1
4               three d type1
5                four e type1
6      another string f type1
7                five g type2
8                 six h type2
9  yet another string i type2
10              seven j type2
11        last string k type2

I would like to subset it so all rows in between and including any iteration of the string 'string', are removed:

                    a b     c
1                 one a type1
2                five g type2
8                 six h type2

Using the solution accepted in the question I've linked to, I can remove the first set of rows by creating an index of row numbers and using the first two positions in the index:

index = grep("string", df$a)

df[-(ind[1]:ind[2]),]

But what I want to do would also include removing rows between the next pair of integers in my index

df[-(ind[3]:ind[4]),]

My actual index has 128 integers (64 'pairs') so manually extracting the rows as I've done above will be a pain in the neck. My current plan if I can't find an elegant solution is to print the index and manually extract the rows (which, tbh, would probably have been faster than writing this question but would look awful and wouldn't teach me anything):

print(index)

[1]  2  6  9 11

df[-c(2:6, 9:11), ]

Is there a way to loop over each consecutive pair of integers in the index, or another way of doing what I'm trying to do? I'm not a hugely experienced R user and I have scoured SO for what I'm trying to do before creating this example (which I hope adheres to reprex standards; this is the first time I've asked a question).

I have included column 'c' in the reprex, because it reflects the structure of my actual data (one pair of 'string' occurrences in column 'a' for each change in observation for column 'c') and I'm wondering if there's a way to use group_by() with a base sub-setting expression?? But this could be a total red herring; just including it in case it helps.

Create a sequence between consecutive pairs of index using Map and remove those rows. One way to get consecutive pairs is by using alternate logical values.

df[-unlist(Map(`:`, index[c(TRUE, FALSE)], index[c(FALSE, TRUE)])),]

#     a b     c
#1  one a type1
#7 five g type2
#8  six h type2

Since I already posted it on twitter, here's a tidyverse-y solution:

df %>% mutate(stringy = grepl("string", a),
              seq = cumsum(stringy)) %>%
       filter(seq %% 2 == 0, !stringy)

The trick is basically the same - we find which rows have the string you're looking for, then create a way to alternate them (in this case, adding an index with cumsum and then using modulo 2) then filter out the odds plus any last occurrences of the string (which will be the closing indices).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM