简体   繁体   中英

Split parts of string defined by multiple delimiters into multiple variables in R

I have a large list of file names that I need to extract information from using R. The info is delimited by multiple dashes and underscores. I am having trouble figuring out a method that will accommodate the fact that the number of characters between delimiters is not consistent (the order of the information will remain constant, as will the delimiters used (hopefully)).

For example:

 f <- data.frame(c("EI-SM4-AMW11_20160614_082800.wav", "PA-RF-A50_20160614_082800.wav"), stringsAsFactors = FALSE)
 colnames(f)<-"filename"
 f$area <- str_sub(f$filename, 1, 2)
 f$rec <- str_sub(f$filename, 4, 6)
 f$site <- str_sub(f$filename, 8, 12)

This produces correct results for the first file, but incorrect results for the second.

I've tried using the "stringr" and "stringi" packages, and know that hard coding the values in doesn't work, so I've come up with awkward solutions using both packages such as:

f$site <- str_sub(f$filename, 
                  stri_locate_last(f$filename, fixed="-")[,1]+1, 
                  stri_locate_first(f$filename, fixed="_")[,1]-1)

I feel like there must be a more elegant (and robust) method, perhaps involving regex (which I am painfully new to).

I've looked at other examples ( Extract part of string (till the first semicolon) in R , R: Find the last dot in a string , Split string using regular expressions and store it into data frame ).

Any suggestions/pointers would be very much appreciated.

Try this, from the `tidyr' package:

library(tidyr)

f %>% separate(filename, c('area', 'rec', 'site'), sep = '-')

You can also split along multiple difference delimeters, like so:

f %>% separate(filename, c('area', 'rec', 'site', 'date', 'don_know_what_this_is', 'file_extension'), sep = '-|_|\\.')

and then keep only the columns you want using dplyr 's select function:

 library(dplyr)
 library(tidyr)

 f %>% 
   separate(filename,
            c('area', 'rec', 'site', 'date',
              'don_know_what_this_is', 'file_extension'), 
            sep = '-|_|\\.') %>%
   select(area, rec, site)

Something like this:

library(stringr)
library(dplyr)

f$area <- word(f$filename, 1, sep = "-")
f$rec <- word(f$filename, 2, sep = "-")
f$site <- word(f$filename, 3, sep = "-") %>%
        word(1,sep = "_")        

dplyr is not necessary but makes concatenation cleaner. The function word belongs to stringr .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM