I have list of url links and I want to extract year and weeks:
list=c("http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr15fhg2015.pdf","http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr16fl2015.pdf", "http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr17fj2015.pdf","http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr18ff2015.pdf", "http://dps.endavadigital.net/owgr/doc/content/archive/2014/owgr19ff2014.pdf","http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr20kf2015.pdf", "http://dps.endavadigital.net/owgr/doc/content/archive/2016/owgr21ff2016.pdf")
The desired output is as follows:
year week
2015 15
2015 16
2015 17
2015 18
2014 19
2015 20
2016 21
Here is my trial but no luck:
split_links<-setNames(type.convert(data.frame(
str_match(list, 'owgr(\\d+)[a-z][a-z][a-z]+(\\d+)')[, -1])), c('year', 'week'))
Can anyone help me with this please?
Using gsub
and strsplit
from base R.
setNames(do.call(rbind.data.frame,
strsplit(gsub(".*owgr(\\d+).*(\\d{4}).*", "\\2,\\1", v), ",")),
c("year", "week"))
# year week
# 1 2015 15
# 2 2015 16
# 3 2015 17
# 4 2015 18
# 5 2014 19
# 6 2015 20
# 7 2016 21
Data:
v <- c("http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr15fhg2015.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr16fl2015.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr17fj2015.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr18ff2015.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2014/owgr19ff2014.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr20kf2015.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2016/owgr21ff2016.pdf"
)
With the pdf naming scheme in your example data, you can use basename()
along with stringr::str_extract_all()
to extract sequences of digits:
stringr::str_extract_all(basename(df1$links), "\\d+", simplify = TRUE)
[,1] [,2]
[1,] "15" "2015"
[2,] "16" "2015"
[3,] "17" "2015"
[4,] "18" "2015"
[5,] "19" "2014"
[6,] "20" "2015"
[7,] "21" "2016"
Or, in a dataframe:
df1 <- data.frame(links = list)
df1[c("week", "year")] <- stringr::str_extract_all(basename(df1$links), "\\d+", simplify = TRUE)
df1
links week year
1 http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr15fhg2015.pdf 15 2015
2 http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr16fl2015.pdf 16 2015
3 http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr17fj2015.pdf 17 2015
4 http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr18ff2015.pdf 18 2015
5 http://dps.endavadigital.net/owgr/doc/content/archive/2014/owgr19ff2014.pdf 19 2014
6 http://dps.endavadigital.net/owgr/doc/content/archive/2015/owgr20kf2015.pdf 20 2015
7 http://dps.endavadigital.net/owgr/doc/content/archive/2016/owgr21ff2016.pdf 21 2016
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.