NOTE: An update/new question on this begins at =====================
Original post: I am working with utterances, statements spoken by children. From each utterance, if one or more words in the statement match a predefined list of multiple 'core' words (probably 300 words), then I want to input '1' into 'Core' (and if none, then input '0' into 'Core').
If there are one or more words in the statement that are NOT core words, then I want to input '1' into 'Fringe' (and if there are only core words and nothing extra, then input '0' into 'Fringe').
Basically, right now I have only the utterances and from those, I need to identify if any words match one of the core and if there are any extra words, identify those as fringe. Here is a snippet of my data.
Core Fringe Utterance
1 NA NA small
2 NA NA small
3 NA NA where's his bed
4 NA NA there's his bed
5 NA NA there's his bed
6 NA NA is that a pillow
Thanks to rjen from an original post, the following code will allow me to identify core words and fringe words—assuming I know fringe words (which is what I originally anticipated). However, the challenge has changed so that now anything that is NOT core will count as fringe. So basically I need to keep the ability to detect words from the list and define as core but I also need the ability to to search the utterance and if there are any words in the utterance that are NOT core, to identify those as '1' as I will not have a list of fringe words.
library(dplyr)
library(stringr)
coreWords <- c('small', 'bed')
fringeWords <- c('head', 'his')
CFdataNew <- CFdata %>%
mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')),
Fringe = + str_detect(Utterance, str_c(fringeWords, collapse = '|')))
The dput() code is:
structure(list(Utterance = c("small", "small", "where's his bed", "there's his bed", "there's his bed", "is that a pillow", "what is that on his head", "hey he has his arm stuck here", "there there's it", "now you're gonna go night_night", "and that's the thing you can turn on", "yeah where's the music+box"), Core = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Fringe = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -12L))
===================
UPDATE The response from rjen worked great with the example data. However, I now have a 'real' list of core words and utterances, and not all extra words are being picked up as fringe, and a few words are being identified as core even though they are not. I encased the words in double quotes to see if that may help resolve the issue so that the core words are explicit, but it did not.
coreWords <-c("I", "no", "nah", "nope", "yes", "yeah", "yea", "mmhmm", "yah",
"ya", "un-huh", "uhhuh", "my", "the", "want", "is", "it", "that",
"a", "go", "mine", "you", "what", "on", "in", "here", "more",
"out", "off", "some", "help", "all done", "finished")
Df1 <- df %>%
mutate(id = row_number()) %>%
separate_rows(Utterance, sep = ' ') %>%
mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')),
Fringe = + !Core) %>%
group_by(id) %>%
mutate(Core = + (sum(Core) > 0),
Fringe = + (sum(Fringe) > 0)) %>%
slice(1) %>%
select(-Utterance) %>%
left_join(df) %>%
ungroup() %>%
select(Utterance, Core, Fringe, id)
The output from the script above and longer list of core words looks something like this.
# A tibble: 98 x 4
Utterance Core Fringe id
<chr> <int> <int> <int>
1 a baby 1 0 1
2 small 1 0 2
3 yes 1 0 3
4 where's his bed 1 1 4
5 there's his bed 1 1 5
6 where's his pillow 1 1 6
7 what is that on his head 1 0 7
8 hey he has his arm stuck here 1 1 8
9 there there's it 1 0 9
10 now you're gonna go night-night 1 1 10
# ... with 88 more rows
For example, in line 1, 'a' is a core word so '1' for core is correct. However, 'baby' should be picked up as fringe so there should be '1', not '0', for fringe. Lines 7 and 9 also have words that should be identified as fringe but are not.
Additionally, it seems like if the utterance has parts of a core word in it, it's being counted. For example, 'small' is identified as a core word even though it's not (but 'all done' is a core word). 'Where's his bed' is identified as core and fringe, although none of the words are core. Any suggestions on what is happening and how to correct it are greatly appreciated.
A little trick to do this is to replace ( gsub()
) all core words in the utterances with an empty string ""
. Then check if the length of the string ( nchar()
) is still bigger than zero. If is bigger than zero it means that there are non-core words in the utterance. By applying trimws()
to the strings after replacing the core words we make sure that no unwanted whitespaces remain that would be counted as characters.
This is the code by itself.
nchar(trimws(gsub(str_c(coreWords, collapse = '|'), "", CFdata$Utterance))) > 0
#> [1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Here is a step by step version, that let's you inspect what is happening.
CFdata %>%
mutate(
core_words_removed = trimws(gsub(str_c(coreWords, collapse = '|'), "", Utterance)),
no_core_words_included = as.numeric(nchar(core_words_removed) > 0)
)
#> Utterance core_words_removed
#> 1 small
#> 2 small
#> 3 where's his bed where's his
#> 4 there's his bed there's his
#> 5 there's his bed there's his
#> 6 is that a pillow is that a pillow
#> 7 what is that on his head what is that on his head
#> 8 hey he has his arm stuck here hey he has his arm stuck here
#> 9 there there's it there there's it
#> 10 now you're gonna go night_night now you're gonna go night_night
#> 11 and that's the thing you can turn on and that's the thing you can turn on
#> 12 yeah where's the music+box yeah where's the music+box
#> no_core_words_included
#> 1 0
#> 2 0
#> 3 1
#> 4 1
#> 5 1
#> 6 1
#> 7 1
#> 8 1
#> 9 1
#> 10 1
#> 11 1
#> 12 1
And here it is in one step and integrated into your original code snippet.
CFdataNew <-
CFdata %>%
mutate(
Core = as.numeric(str_detect(Utterance, str_c(coreWords, collapse = '|'))),
no_core_words_included = as.numeric(nchar(gsub(
str_c(coreWords, collapse = '|'), "", Utterance
)) > 0),
Fringe = as.numeric(str_detect(
Utterance, str_c(fringeWords, collapse = '|')
))
)
A tidyverse option using separate_rows()
library(dplyr)
library(stringr)
library(tidyr)
coreWords <- c('small', 'bed')
df1 <- df %>%
transmute(id = row_number(),
Utterance = Utterance)
df %>%
mutate(id = row_number()) %>%
separate_rows(Utterance, sep = ' ') %>%
mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')),
Fringe = + !Core) %>%
group_by(id) %>%
mutate(Core = + (sum(Core) > 0),
Fringe = + (sum(Fringe) > 0)) %>%
slice(1) %>%
select(-Utterance) %>%
left_join(df1) %>%
ungroup() %>%
select(Utterance, Core, Fringe, -id)
# # A tibble: 12 x 3
# Utterance Core Fringe
# <chr> <int> <int>
# 1 small 1 0
# 2 small 1 0
# 3 where's his bed 1 1
# 4 there's his bed 1 1
# 5 there's his bed 1 1
# 6 is that a pillow 0 1
# 7 what is that on his head 0 1
# 8 hey he has his arm stuck here 0 1
# 9 there there's it 0 1
# 10 now you're gonna go night_night 0 1
# 11 and that's the thing you can turn on 0 1
# 12 yeah where's the music+box 0 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.