简体   繁体   中英

Extract substring in R from string with fixed start position and end point as a character found

I want to do the following extraction in R.

I have a column which has links like these http://www.imdb.com/title/tt2569314/companycredits

I want to extract the tt2569314 out of this and store it in a new column.

The way I want to do it is, say, take substring of column where start position is LEN( http://www.imdb.com/ ) and end position is dynamic based on when the first '/' is found after the start position.

I want this to be kind of a mixture of SUBSTR and INSTR in SQL.

Please advise.

You could try this:

a<-"http://www.imdb.com/title/tt2569314/companycredits"
sub("http://www.imdb.com/.+/(.+)/.+","\\1" ,a)
#[1] "tt2569314"

If all the links are similar in path structure, you can use the dirname

x <- "http://www.imdb.com/title/tt2569314/companycredits"
sub("(.*)[/]", "", dirname(x))
# [1] "tt2569314"

Or you can paste together a regular expression with the base URL

y <- "http://www.imdb.com"
sub(paste0(y, "[/](.*)[/](.*)[/](.*)"), "\\2", x)
# [1] "tt2569314"

Or you may even be able to get away with this:

basename(dirname(x))
# [1] "tt2569314"

It's a bit more drawn out if you use the substring. But stringr has a couple of helpful functions.

library(stringr)
s1 <- str_locate_all(x, "[/]")[[1]]
s2 <- str_locate(x, "http://www.imdb.com/title")
m <- match(s2[,2]+1, s1[,1])
substr(x, s1[m,1]+1, s1[m+1,1]-1)
# [1] "tt2569314"

You could try:

 str1 <- "http://www.imdb.com/title/tt2569314/companycredits"
 library(httr)
 gsub("^[^/]*\\/|\\/[^/]*", "", parse_url(str1)$path)
 #[1] "tt2569314"

You may try this also,

> x <- "http://www.imdb.com/title/tt2569314/companycredits"
> m <- regexpr("^http://www.imdb.com/[^/]*/\\K[^/]+", x, perl=TRUE)
> regmatches(x, m)
[1] "tt2569314"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM