简体   繁体   中英

splitting strings with regex in R?

I have list of url links and I want to extract year and weeks:

list=c("http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr15fhg2015.pdf","http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr16fl2015.pdf",       "http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr17fj2015.pdf","http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr18ff2015.pdf",    "http://dps.endavadigital.net/owgr/doc/content/archive/2014/owgr19ff2014.pdf","http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr20kf2015.pdf",     "http://dps.endavadigital.net/owgr/doc/content/archive/2016/owgr21ff2016.pdf")

The desired output is as follows:

year  week
2015  15
2015  16
2015  17
2015  18
2014  19
2015  20
2016  21

Here is my trial but no luck:

split_links<-setNames(type.convert(data.frame(
  str_match(list, 'owgr(\\d+)[a-z][a-z][a-z]+(\\d+)')[, -1])), c('year', 'week'))

Can anyone help me with this please?

Using gsub and strsplit from base R.

setNames(do.call(rbind.data.frame, 
        strsplit(gsub(".*owgr(\\d+).*(\\d{4}).*", "\\2,\\1", v), ",")), 
        c("year", "week"))
#   year week
# 1 2015   15
# 2 2015   16
# 3 2015   17
# 4 2015   18
# 5 2014   19
# 6 2015   20
# 7 2016   21

Data:

v <- c("http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr15fhg2015.pdf", 
"http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr16fl2015.pdf", 
"http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr17fj2015.pdf", 
"http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr18ff2015.pdf", 
"http://dps.endavadigital.net/owgr/doc/content/archive/2014/owgr19ff2014.pdf", 
"http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr20kf2015.pdf", 
"http://dps.endavadigital.net/owgr/doc/content/archive/2016/owgr21ff2016.pdf"
)

With the pdf naming scheme in your example data, you can use basename() along with stringr::str_extract_all() to extract sequences of digits:

stringr::str_extract_all(basename(df1$links), "\\d+", simplify = TRUE)
     [,1] [,2]  
[1,] "15" "2015"
[2,] "16" "2015"
[3,] "17" "2015"
[4,] "18" "2015"
[5,] "19" "2014"
[6,] "20" "2015"
[7,] "21" "2016"

Or, in a dataframe:

df1 <- data.frame(links = list)

df1[c("week", "year")] <-  stringr::str_extract_all(basename(df1$links), "\\d+", simplify = TRUE)
df1
                                                                         links week year
1 http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr15fhg2015.pdf   15 2015
2  http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr16fl2015.pdf   16 2015
3  http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr17fj2015.pdf   17 2015
4  http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr18ff2015.pdf   18 2015
5  http://dps.endavadigital.net/owgr/doc/content/archive/2014/owgr19ff2014.pdf   19 2014
6  http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr20kf2015.pdf   20 2015
7  http://dps.endavadigital.net/owgr/doc/content/archive/2016/owgr21ff2016.pdf   21 2016

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM