Example dataframe:
id url ...
1 www.hello.com/art/dance/article/title1/nothing
2 www.hello.com/dance/nothing
3 www.hello.com/art/dance/article/title2/nothing
4 www.hello.com/art/dance/article/title3/something
5 www.hello.com/art/dance/
6 www.hello.com/art/article/title4/nothing
7 www.hello.com/art/dance/article/title2/nothing
8 www.hello.com/art/dance/article/title3/something
...
I'm using grep to filter rows that contain a title in the url. The idea is to label certain urls. I'm running this with multiple titles.
df[grep('.+/TITLE-IM-LOOKING-FOR/.+', clickstream$url, value = FALSE,perl=TRUE),]$label <- "ChoosenLabel"
Is there a better way to filter and label the urls? Is grep always the best option?
Output
id url Label
1 www.hello.com/art/dance/article/title1/nothing
2 www.hello.com/dance/nothing
3 www.hello.com/art/dance/article/title2/nothing
4 www.hello.com/art/dance/article/TITLE-IM-LOOKING-FOR/something ChoosenLabel
5 www.hello.com/art/dance/
6 www.hello.com/art/article/title4/nothing
7 www.hello.com/art/dance/article/title2/nothing
8 www.hello.com/art/dance/article/TITLE-IM-LOOKING-FOR/something ChoosenLabel
Update: Found that removing the .+ increases speed like crazy
doinf this in base R package:
transform(dat,Label=ifelse(grepl("title3",url),"title3",""))
id url Label
1 1 www.hello.com/art/dance/article/title1/nothing
2 2 www.hello.com/dance/nothing
3 3 www.hello.com/art/dance/article/title2/nothing
4 4 www.hello.com/art/dance/article/title3/something title3
5 5 www.hello.com/art/dance/
6 6 www.hello.com/art/article/title4/nothing
7 7 www.hello.com/art/dance/article/title2/nothing
8 8 www.hello.com/art/dance/article/title3/something title3
One option is just use grep
to get the value. Say you are looking for 'dance` then just try:
> grep(".+/dance/.+", df$url, value = TRUE)
[1] "`www.hello.com/art/dance/article/title1/nothing`"
[2] "www.hello.com/dance/nothing"
[3] "www.hello.com/art/dance/article/title2/nothing"
[4] "www.hello.com/art/dance/article/title3/something"
[5] "www.hello.com/art/dance/article/title2/nothing"
[6] "www.hello.com/art/dance/article/title3/something"
Another example could be:
> grep(".+/title3/.+", df$url, value = TRUE)
[1] "www.hello.com/art/dance/article/title3/something"
[2] "www.hello.com/art/dance/article/title3/something"
Option 1 using dplyr:
# Create data
clickstream <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
id url
1 www.hello.com/art/dance/article/title1/nothing
2 www.hello.com/dance/nothing
3 www.hello.com/art/dance/article/title2/nothing
4 www.hello.com/art/dance/article/title3/something
5 www.hello.com/art/dance/
6 www.hello.com/art/article/title4/nothing
7 www.hello.com/art/dance/article/title2/nothing
8 www.hello.com/art/dance/article/title3/something")
# Your pattern
regex <- "+./title3/+"
replacement <- "/TITLE-IM-LOOKING-FOR/"
# computation
library(dplyr)
clickstream %>%
mutate(label = if_else(grepl(regex, .$url), "ChoosenLabel", "")) %>%
mutate(url = if_else(label != "", gsub(regex, replacement, url), url))
output:
id url label
1 1 www.hello.com/art/dance/article/title1/nothing
2 2 www.hello.com/dance/nothing
3 3 www.hello.com/art/dance/article/title2/nothing
4 4 www.hello.com/art/dance/articl/TITLE-IM-LOOKING-FOR/something ChoosenLabel
5 5 www.hello.com/art/dance/
6 6 www.hello.com/art/article/title4/nothing
7 7 www.hello.com/art/dance/article/title2/nothing
8 8 www.hello.com/art/dance/articl/TITLE-IM-LOOKING-FOR/something ChoosenLabel
Option 2 using data.table (same output):
library(data.table)
dt <- setDT(clickstream)
dt[, label := if_else(grepl(regex, url), "ChoosenLabel", "")]
dt[label != "", url := gsub(regex, replacement, url)]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.