简体   繁体   中英

Partial Case-Insensitive String Search in R

I have a spotify$gens column where it contains all the descriptions of genres of each album.

For example: head(spotify$gens) gives

gens = c("Jazz Fusion", "Latin Rock, Progressive Rock", "Progressive Rock", 
"Blues Rock, Electric Blues", "Electric Texas Blues, Electric Blues", "Piano Blues, Chicago Blues")

I want to use what I have made:

keyGenres = c("Pop","Rock","Hip Hop","Latin",
  "Dance","Electronic","R&B","Country","Folk",
  "Acoustic","Classical","Metal","Jazz","New Age",
  "Blues","World","Traditional")

to match the spotify$gens and return the matching part of the string string.

I have this code right now:

for (i in seq_along(spotify$gens)){
  for (genre in keyGenres){
    if( spotify$gens[i] %ilike% keyGenres[genre]){
       spotify$gens[i] <- keyGenres[genre]
    } else{
      spotify$gens[i] = spotify$gens[i]
    }}}

but it is returning me this error: Error in if (spotify$gens[i] %ilike% keyGenres[genre]) {: missing value where TRUE/FALSE needed

An example result i want would be spotify$gens[1] = "Jazz Fusion" to spotify$gens[1] = "Jazz"

Some albums have more than one genre and I want to return the first string that is matched only.

Can anyone help me out? Thank you!!

The problem with your loop using is that you're using genre as an integer, so you need the seq_along in for (genre in seq_along(keyGenres)) :

for (i in seq_along(gens)){
  for (genre in seq_along(keyGenres)){
    if( gens[i] %ilike% keyGenres[genre]){
       gens[i] <- keyGenres[genre]
    } else{
      gens[i] = gens[i]
    }}}
gens
# [1] "Jazz"  "Rock"  "Rock"  "Rock"  "Blues" "Blues"

We can use str_replace_all which is vectorized, and allows for a vector of regex patterns and replacements to eliminate the loops. This will be much more efficient:

library(stringr)
pat_replace = setNames(keyGenres, paste0(".*", tolower(keyGenres), ".*"))
result = str_replace_all(tolower(gens), pattern = pat_replace)
result
# [1] "Jazz"  "Rock"  "Rock"  "Rock"  "Blues" "Blues"

Using this data:

gens = c("Jazz Fusion", "Latin Rock, Progressive Rock", "Progressive Rock", 
"Blues Rock, Electric Blues", "Electric Texas Blues, Electric Blues", "Piano Blues, Chicago Blues")
keyGenres = c("Pop","Rock","Hip Hop","Latin","Dance","Electronic","R&B","Country","Folk","Acoustic","Classical","Metal","Jazz","New Age","Blues","World","Traditional")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM