简体   繁体   English

R - 如何使用整数索引对 dataframe 的行进行子集化?

[英]R - How can I subset rows of a dataframe using an index of integers?

I am trying to find a method for subsetting or slicing a dataframe based on each occurrence of a certain string appearing in one column/variable - eg I would like to delete all rows between two occurrences of the string.我正在尝试根据出现在一个列/变量中的某个字符串的每次出现来找到一种对 dataframe 进行子集或切片的方法 - 例如,我想删除字符串两次出现之间的所有行。 This problem is similar to this question BUT the crucial difference is that I have multiple occurrences of the string and would like to delete the rows between each pair of occurrences.这个问题类似于这个问题,但关键的区别是我有多次出现的字符串,并且想删除每对出现之间的行。 I'm an R dunce and I can't find a way to apply the solution to an index of more than two integers in any elegant kind of way.我是一个 R 笨蛋,我找不到以任何优雅的方式将解决方案应用于超过两个整数的索引的方法。

Say I have the following dataframe:假设我有以下 dataframe:

a <- c("one", "here is a string", "two", "three", "four", "another string", "five", "six", "yet another string", "seven", "last string")
b <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k")
c <- c("type1", "type1", "type1", "type1", "type1", "type1", "type2", "type2", "type2", "type2", "type2")

df <- data.frame(a,b,c)

Which gives the following:这给出了以下内容:

print(df)

                 a b     c
1                 one a type1
2    here is a string b type1
3                 two c type1
4               three d type1
5                four e type1
6      another string f type1
7                five g type2
8                 six h type2
9  yet another string i type2
10              seven j type2
11        last string k type2

I would like to subset it so all rows in between and including any iteration of the string 'string', are removed:我想对其进行子集化,以便删除其中的所有行,包括字符串“string”的任何迭代:

                    a b     c
1                 one a type1
2                five g type2
8                 six h type2

Using the solution accepted in the question I've linked to, I can remove the first set of rows by creating an index of row numbers and using the first two positions in the index:使用我链接到的问题中接受的解决方案,我可以通过创建行号索引并使用索引中的前两个位置来删除第一组行:

index = grep("string", df$a)

df[-(ind[1]:ind[2]),]

But what I want to do would also include removing rows between the next pair of integers in my index但我想做的还包括删除索引中下一对整数之间的行

df[-(ind[3]:ind[4]),]

My actual index has 128 integers (64 'pairs') so manually extracting the rows as I've done above will be a pain in the neck.我的实际索引有 128 个整数(64 个“对”),所以像我上面所做的那样手动提取行会让人头疼。 My current plan if I can't find an elegant solution is to print the index and manually extract the rows (which, tbh, would probably have been faster than writing this question but would look awful and wouldn't teach me anything):如果我找不到一个优雅的解决方案,我目前的计划是打印索引并手动提取行(tbh,这可能比写这个问题要快,但看起来很糟糕,不会教我任何东西):

print(index)

[1]  2  6  9 11

df[-c(2:6, 9:11), ]

Is there a way to loop over each consecutive pair of integers in the index, or another way of doing what I'm trying to do?有没有办法遍历索引中每对连续的整数,或者另一种方法来做我想做的事情? I'm not a hugely experienced R user and I have scoured SO for what I'm trying to do before creating this example (which I hope adheres to reprex standards; this is the first time I've asked a question).我不是一个经验丰富的 R 用户,在创建此示例之前,我已经搜索了我想要做的事情(我希望遵守 reprex 标准;这是我第一次提出问题)。

I have included column 'c' in the reprex, because it reflects the structure of my actual data (one pair of 'string' occurrences in column 'a' for each change in observation for column 'c') and I'm wondering if there's a way to use group_by() with a base sub-setting expression??我在reprex中包含了列'c',因为它反映了我的实际数据的结构(对于列'c'的每次观察变化,列'a'中出现一对'字符串'),我想知道是否有一种方法可以将 group_by() 与基本子设置表达式一起使用? But this could be a total red herring;但这可能是一个完全的红鲱鱼。 just including it in case it helps.只是包括它以防万一。

Create a sequence between consecutive pairs of index using Map and remove those rows.使用Map在连续的index对之间创建一个序列并删除这些行。 One way to get consecutive pairs is by using alternate logical values.获得连续对的一种方法是使用备用逻辑值。

df[-unlist(Map(`:`, index[c(TRUE, FALSE)], index[c(FALSE, TRUE)])),]

#     a b     c
#1  one a type1
#7 five g type2
#8  six h type2

Since I already posted it on twitter, here's a tidyverse-y solution:由于我已经在 twitter 上发布了它,这里有一个 tidyverse-y 解决方案:

df %>% mutate(stringy = grepl("string", a),
              seq = cumsum(stringy)) %>%
       filter(seq %% 2 == 0, !stringy)

The trick is basically the same - we find which rows have the string you're looking for, then create a way to alternate them (in this case, adding an index with cumsum and then using modulo 2) then filter out the odds plus any last occurrences of the string (which will be the closing indices).技巧基本上是一样的——我们找到哪些行有你要找的字符串,然后创建一种交替它们的方法(在这种情况下,用 cumsum 添加一个索引,然后使用模 2)然后过滤掉赔率加上任何字符串的最后一次出现(这将是结束索引)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何删除 dataframe 的所有行,这些行在 R 的列子集中具有相同的字符串值? - How can I remove all rows of a dataframe that have the same string value across a subset of columns in R? 如何使用变量作为索引来对 R 中的 dataframe 进行子集化? - How to use a variable as index to subset a dataframe in R? R数据框-如何添加更多行作为子集 - R dataframe - how to add more rows as a subset 如何根据 r 中的时间对数据帧进行子集化? - How can I subset a dataframe based on time of day in r? 如何在R中按nrow和group子集数据帧? - How can subset a dataframe by nrow and groups in r? R:如何对 dataframe 进行子集化? - R: how do I subset a dataframe? 如何使用R中的向量选择数据帧的子集 - How to choose the subset of a dataframe using a vector in R 我试图回忆如何从选择特定行的数据框中进行子集化,同时将列名称保留在 R 中 - I am attempting to recall how to subset from a dataframe selecting specific rows while keeping the Column names in R 如何使用 R 根据 dataframe 中各个列中的最小值对特定列中的行进行子集化 - how to subset rows in specific columns based on minimum values in individual columns in a dataframe using R 如何根据 R 中另一个 dataframe 中的两个同时满足的条件对 dataframe 进行子集化? - How can I subset a dataframe based on two simultaneously fulfilled conditions in another dataframe in R?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM