简体   繁体   中英

Identifying words from a list and code as 0 or 1 and words NOT on the list code as 1

NOTE: An update/new question on this begins at =====================

Original post: I am working with utterances, statements spoken by children. From each utterance, if one or more words in the statement match a predefined list of multiple 'core' words (probably 300 words), then I want to input '1' into 'Core' (and if none, then input '0' into 'Core').

If there are one or more words in the statement that are NOT core words, then I want to input '1' into 'Fringe' (and if there are only core words and nothing extra, then input '0' into 'Fringe').

Basically, right now I have only the utterances and from those, I need to identify if any words match one of the core and if there are any extra words, identify those as fringe. Here is a snippet of my data.

  Core Fringe        Utterance
1   NA     NA            small
2   NA     NA            small
3   NA     NA  where's his bed
4   NA     NA  there's his bed
5   NA     NA  there's his bed
6   NA     NA is that a pillow

Thanks to rjen from an original post, the following code will allow me to identify core words and fringe words—assuming I know fringe words (which is what I originally anticipated). However, the challenge has changed so that now anything that is NOT core will count as fringe. So basically I need to keep the ability to detect words from the list and define as core but I also need the ability to to search the utterance and if there are any words in the utterance that are NOT core, to identify those as '1' as I will not have a list of fringe words.

library(dplyr)
library(stringr)

coreWords <- c('small', 'bed')
fringeWords <- c('head', 'his')

CFdataNew <- CFdata %>%
  mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')),
         Fringe = + str_detect(Utterance, str_c(fringeWords, collapse = '|')))

The dput() code is:

    structure(list(Utterance = c("small", "small", "where's his bed", "there's his bed", "there's his bed", "is that a pillow", "what is that on his head", "hey he has his arm stuck here", "there there's it", "now you're gonna go night_night", "and that's the thing you can turn on", "yeah where's the music+box"), Core = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Fringe = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -12L))

===================

UPDATE The response from rjen worked great with the example data. However, I now have a 'real' list of core words and utterances, and not all extra words are being picked up as fringe, and a few words are being identified as core even though they are not. I encased the words in double quotes to see if that may help resolve the issue so that the core words are explicit, but it did not.

coreWords <-c("I", "no", "nah", "nope", "yes", "yeah", "yea", "mmhmm", "yah",
              "ya", "un-huh", "uhhuh", "my", "the", "want", "is", "it", "that",
              "a", "go", "mine", "you", "what", "on", "in", "here", "more",
              "out", "off", "some", "help", "all done", "finished")

Df1 <- df %>%
  mutate(id = row_number()) %>%
  separate_rows(Utterance, sep = ' ') %>%
  mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')),
         Fringe = + !Core) %>%
  group_by(id) %>%
  mutate(Core = + (sum(Core) > 0),
         Fringe = + (sum(Fringe) > 0)) %>%
  slice(1) %>%
  select(-Utterance) %>%
  left_join(df) %>% 
  ungroup() %>%
  select(Utterance, Core, Fringe, id)

The output from the script above and longer list of core words looks something like this.

# A tibble: 98 x 4
   Utterance                        Core Fringe    id
   <chr>                           <int>  <int> <int>
 1 a baby                              1      0     1
 2 small                               1      0     2
 3 yes                                 1      0     3
 4 where's his bed                     1      1     4
 5 there's his bed                     1      1     5
 6 where's his pillow                  1      1     6
 7 what is that on his head            1      0     7
 8 hey he has his arm stuck here       1      1     8
 9 there there's it                    1      0     9
10 now you're gonna go night-night     1      1    10
# ... with 88 more rows

For example, in line 1, 'a' is a core word so '1' for core is correct. However, 'baby' should be picked up as fringe so there should be '1', not '0', for fringe. Lines 7 and 9 also have words that should be identified as fringe but are not.

Additionally, it seems like if the utterance has parts of a core word in it, it's being counted. For example, 'small' is identified as a core word even though it's not (but 'all done' is a core word). 'Where's his bed' is identified as core and fringe, although none of the words are core. Any suggestions on what is happening and how to correct it are greatly appreciated.

A little trick to do this is to replace ( gsub() ) all core words in the utterances with an empty string "" . Then check if the length of the string ( nchar() ) is still bigger than zero. If is bigger than zero it means that there are non-core words in the utterance. By applying trimws() to the strings after replacing the core words we make sure that no unwanted whitespaces remain that would be counted as characters.

This is the code by itself.

nchar(trimws(gsub(str_c(coreWords, collapse = '|'), "", CFdata$Utterance))) > 0
#>  [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Here is a step by step version, that let's you inspect what is happening.

CFdata %>%
  mutate(
    core_words_removed = trimws(gsub(str_c(coreWords, collapse = '|'), "", Utterance)),
    no_core_words_included = as.numeric(nchar(core_words_removed) > 0)
  )
#>                               Utterance                   core_words_removed
#> 1                                 small                                     
#> 2                                 small                                     
#> 3                       where's his bed                          where's his
#> 4                       there's his bed                          there's his
#> 5                       there's his bed                          there's his
#> 6                      is that a pillow                     is that a pillow
#> 7              what is that on his head             what is that on his head
#> 8         hey he has his arm stuck here        hey he has his arm stuck here
#> 9                      there there's it                     there there's it
#> 10      now you're gonna go night_night      now you're gonna go night_night
#> 11 and that's the thing you can turn on and that's the thing you can turn on
#> 12           yeah where's the music+box           yeah where's the music+box
#>    no_core_words_included
#> 1                       0
#> 2                       0
#> 3                       1
#> 4                       1
#> 5                       1
#> 6                       1
#> 7                       1
#> 8                       1
#> 9                       1
#> 10                      1
#> 11                      1
#> 12                      1

And here it is in one step and integrated into your original code snippet.

CFdataNew <-
  CFdata %>%
  mutate(
    Core = as.numeric(str_detect(Utterance, str_c(coreWords, collapse = '|'))),
    no_core_words_included = as.numeric(nchar(gsub(
      str_c(coreWords, collapse = '|'), "", Utterance
    )) > 0),
    Fringe = as.numeric(str_detect(
      Utterance, str_c(fringeWords, collapse = '|')
    ))
  )

A tidyverse option using separate_rows()

library(dplyr)
library(stringr)
library(tidyr)

coreWords <- c('small', 'bed')

df1 <- df %>%
  transmute(id = row_number(),
            Utterance = Utterance)

df %>%
  mutate(id = row_number()) %>%
  separate_rows(Utterance, sep = ' ') %>%
  mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')),
         Fringe = + !Core) %>%
  group_by(id) %>%
  mutate(Core = + (sum(Core) > 0),
         Fringe = + (sum(Fringe) > 0)) %>%
  slice(1) %>%
  select(-Utterance) %>%
  left_join(df1) %>%
  ungroup() %>%
  select(Utterance, Core, Fringe, -id)

# # A tibble: 12 x 3
#    Utterance                             Core Fringe
#    <chr>                                <int>  <int>
#  1 small                                    1      0
#  2 small                                    1      0
#  3 where's his bed                          1      1
#  4 there's his bed                          1      1
#  5 there's his bed                          1      1
#  6 is that a pillow                         0      1
#  7 what is that on his head                 0      1
#  8 hey he has his arm stuck here            0      1
#  9 there there's it                         0      1
# 10 now you're gonna go night_night          0      1
# 11 and that's the thing you can turn on     0      1
# 12 yeah where's the music+box               0      1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM