I have several dataframes in R of the following shape:
> pos.sentence
doc_id token_id token pos
1 d1 1 Ik PRON
2 d1 2 weet VERB
3 d1 3 dat SCONJ
4 d1 4 jij PRON
5 d1 5 dat SCONJ
6 d1 6 wil AUX
7 d1 7 en CCONJ
8 d1 8 ik PRON
9 d1 9 heb AUX
10 d1 10 het DET
11 d1 11 al ADV
12 d1 12 gekocht VERB
What I would like to do is to create subsets of the data where all the rows from PRON (which appears in the pos column) until the next instance of PRON are gathered. Thus, in this case, resulting in three separate subsets/dataframes:
doc_id token_id token pos
1 d1 1 Ik PRON
2 d1 2 weet VERB
3 d1 3 dat SCONJ
doc_id token_id token pos
4 d1 4 jij PRON
5 d1 5 dat SCONJ
6 d1 6 wil AUX
7 d1 7 en CCONJ
doc_id token_id token pos
8 d1 8 ik PRON
9 d1 9 heb AUX
10 d1 10 het DET
11 d1 11 al ADV
12 d1 12 gekocht VERB
Is there anyone who knows a way to do so? The dataframes that serve as my input vary in size, so I cannot make subsets on the base of row number.
How about this? First, determine group membership:
library(tidyverse)
z <- posdata %>% mutate(ispron=(1*(pos=="PRON"))) %>%
mutate(group=cumsum(c(1, sign(diff(ispron)) > 0)))
Net, split into multiple objects:
> split(z,z$group)
$`1`
# A tibble: 3 x 6
doc_id token_id token pos ispron group
<fct> <int> <fct> <fct> <dbl> <dbl>
1 d1 1 Ik PRON 1. 1.
2 d1 2 weet VERB 0. 1.
3 d1 3 dat SCONJ 0. 1.
$`2`
# A tibble: 4 x 6
doc_id token_id token pos ispron group
<fct> <int> <fct> <fct> <dbl> <dbl>
1 d1 4 jij PRON 1. 2.
2 d1 5 dat SCONJ 0. 2.
3 d1 6 wil AUX 0. 2.
4 d1 7 en CCONJ 0. 2.
$`3`
# A tibble: 5 x 6
doc_id token_id token pos ispron group
<fct> <int> <fct> <fct> <dbl> <dbl>
1 d1 8 ik PRON 1. 3.
2 d1 9 heb AUX 0. 3.
3 d1 10 het DET 0. 3.
4 d1 11 al ADV 0. 3.
5 d1 12 gekocht VERB 0. 3.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.