简体   繁体   中英

R subset range of rows below a certain string

I have several dataframes in R of the following shape:

> pos.sentence
   doc_id token_id   token   pos
1      d1        1      Ik  PRON
2      d1        2    weet  VERB
3      d1        3     dat SCONJ
4      d1        4     jij  PRON
5      d1        5     dat SCONJ
6      d1        6     wil   AUX
7      d1        7      en CCONJ
8      d1        8      ik  PRON
9      d1        9     heb   AUX
10     d1       10     het   DET
11     d1       11      al   ADV
12     d1       12 gekocht  VERB

What I would like to do is to create subsets of the data where all the rows from PRON (which appears in the pos column) until the next instance of PRON are gathered. Thus, in this case, resulting in three separate subsets/dataframes:

   doc_id token_id   token   pos
1      d1        1      Ik  PRON
2      d1        2    weet  VERB
3      d1        3     dat SCONJ

   doc_id token_id   token   pos
4      d1        4     jij  PRON
5      d1        5     dat SCONJ
6      d1        6     wil   AUX
7      d1        7      en CCONJ

   doc_id token_id   token   pos
8      d1        8      ik  PRON
9      d1        9     heb   AUX
10     d1       10     het   DET
11     d1       11      al   ADV
12     d1       12 gekocht  VERB

Is there anyone who knows a way to do so? The dataframes that serve as my input vary in size, so I cannot make subsets on the base of row number.

How about this? First, determine group membership:

library(tidyverse)
z <- posdata %>% mutate(ispron=(1*(pos=="PRON"))) %>% 
    mutate(group=cumsum(c(1, sign(diff(ispron)) > 0)))

Net, split into multiple objects:

> split(z,z$group) 
$`1`
# A tibble: 3 x 6
  doc_id token_id token pos   ispron group
  <fct>     <int> <fct> <fct>  <dbl> <dbl>
1 d1            1 Ik    PRON      1.    1.
2 d1            2 weet  VERB      0.    1.
3 d1            3 dat   SCONJ     0.    1.

$`2`
# A tibble: 4 x 6
  doc_id token_id token pos   ispron group
  <fct>     <int> <fct> <fct>  <dbl> <dbl>
1 d1            4 jij   PRON      1.    2.
2 d1            5 dat   SCONJ     0.    2.
3 d1            6 wil   AUX       0.    2.
4 d1            7 en    CCONJ     0.    2.

$`3`
# A tibble: 5 x 6
  doc_id token_id token   pos   ispron group
  <fct>     <int> <fct>   <fct>  <dbl> <dbl>
1 d1            8 ik      PRON      1.    3.
2 d1            9 heb     AUX       0.    3.
3 d1           10 het     DET       0.    3.
4 d1           11 al      ADV       0.    3.
5 d1           12 gekocht VERB      0.    3.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM