简体   繁体   中英

Search a vector of strings for all instances of a string in another a vector of strings

I have a some data in a tibble with a number column and an associated sentience. I also have a vector of about ~450 shorter strings that I want to check each sentience for. Ultimately I want to know the sum value of the number entries associated with each of the ~450 strings (after prorating each of the sentences based on the number of hits - eg, if one of the sentences is associated with a number value of 3 and had two hits from the 450 strings I'd want to add 1.5 to each of their tallies, see examples 1 and 2 both of which appear in the first "sentience" string)

The example below gets to the "final_result" that I want for the 4 example strings, but is not practical for the ~450 strings. (I'm not particularly wedded to building a large table of 2 + ~450 columns to do this, so if this can be done with a single list return for the search matches, or any other way, that's fine.)

Can someone suggest a more scalable and appropriate way to arrive at the same basic output?

Thanks very much.

##Tibble with some strings and associated numbers
pacman::p_load(stringi, tidyverse)
set.seed(1)
entries <- tibble("numbers" = rnorm(100),
                    "strings" = stri_rand_strings(100, 15, "[A-Za-z]"))

#Strings known to show up for example
strings_to_find <- c("NJad", "GNl", "Qaw", "bQ")

#Answers in the form of a table
answers_as_table <- entries %>% 
  mutate(String1 = str_detect(entries$strings, pattern = strings_to_find[[1]]),
         String2 = str_detect(entries$strings, pattern = strings_to_find[[2]]),
         String3 = str_detect(entries$strings, pattern = strings_to_find[[3]]),
         String4 = str_detect(entries$strings, pattern = strings_to_find[[4]]))

#Find the number of strings in each entry
answers_as_table$CountofHits <- rowSums(answers_as_table[,3:6])
#prorate accordingly
answers_as_table$proration <- answers_as_table$numbers / answers_as_table$CountofHits

#Find the sum of the prorated amount
SumString1 <- sum(answers_as_table[answers_as_table$String1,8])
SumString2 <- sum(answers_as_table[answers_as_table$String2,8])
SumString3 <- sum(answers_as_table[answers_as_table$String3,8])
SumString4 <- sum(answers_as_table[answers_as_table$String4,8])


(final_product <- tibble("strings_to_find" = strings_to_find, 
       "Sums" = c(SumString1, SumString2, SumString3, SumString4)))```

A base R attempt for giggles:

g <- stack(sapply(strings_to_find, grep, x=entries$strings, simplify=FALSE))
g$numbers <- entries$numbers[g$values]
g$prorata <- ave(g$numbers, g$values, FUN=function(x) x/length(x))
out <- aggregate(prorata ~ ind, data=g, sum)
out

#   ind    prorata
#1 NJad -0.3132269
#2  GNl -0.3132269
#3  Qaw  0.1836433
#4   bQ  0.3575099

Compares well:

out == final_product
#      ind prorata
#[1,] TRUE    TRUE
#[2,] TRUE    TRUE
#[3,] TRUE    TRUE
#[4,] TRUE    TRUE

We can loop over the vector and create the columns

library(purrr)
library(stringr)
library(dplyr)
answers_as_table <- map2_dfc(strings_to_find,
        str_c("String", seq_along(strings_to_find)),
         ~ entries %>% 
              transmute(!! .y := str_detect(strings, .x))) %>%
        mutate(CountofHits = rowSums(.)) 
 sumstring <- answers_as_table %>%
                summarise(across(starts_with('String'), sum))
            

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM