简体   繁体   中英

Using grep to filter urls

Example dataframe:

id url                                              ...                                           
1  www.hello.com/art/dance/article/title1/nothing
2  www.hello.com/dance/nothing
3  www.hello.com/art/dance/article/title2/nothing
4  www.hello.com/art/dance/article/title3/something
5  www.hello.com/art/dance/
6  www.hello.com/art/article/title4/nothing
7  www.hello.com/art/dance/article/title2/nothing
8  www.hello.com/art/dance/article/title3/something
...

I'm using grep to filter rows that contain a title in the url. The idea is to label certain urls. I'm running this with multiple titles.

df[grep('.+/TITLE-IM-LOOKING-FOR/.+', clickstream$url, value = FALSE,perl=TRUE),]$label <- "ChoosenLabel"

Is there a better way to filter and label the urls? Is grep always the best option?

Output

id url                                                                 Label                                          
1  www.hello.com/art/dance/article/title1/nothing
2  www.hello.com/dance/nothing
3  www.hello.com/art/dance/article/title2/nothing
4  www.hello.com/art/dance/article/TITLE-IM-LOOKING-FOR/something ChoosenLabel
5  www.hello.com/art/dance/
6  www.hello.com/art/article/title4/nothing
7  www.hello.com/art/dance/article/title2/nothing
8  www.hello.com/art/dance/article/TITLE-IM-LOOKING-FOR/something ChoosenLabel

Update: Found that removing the .+ increases speed like crazy

doinf this in base R package:

transform(dat,Label=ifelse(grepl("title3",url),"title3",""))

  id                                              url  Label
1  1   www.hello.com/art/dance/article/title1/nothing       
2  2                      www.hello.com/dance/nothing       
3  3   www.hello.com/art/dance/article/title2/nothing       
4  4 www.hello.com/art/dance/article/title3/something title3
5  5                         www.hello.com/art/dance/       
6  6         www.hello.com/art/article/title4/nothing       
7  7   www.hello.com/art/dance/article/title2/nothing       
8  8 www.hello.com/art/dance/article/title3/something title3

One option is just use grep to get the value. Say you are looking for 'dance` then just try:

> grep(".+/dance/.+", df$url, value = TRUE)
[1] "`www.hello.com/art/dance/article/title1/nothing`"
[2] "www.hello.com/dance/nothing"                     
[3] "www.hello.com/art/dance/article/title2/nothing"  
[4] "www.hello.com/art/dance/article/title3/something"
[5] "www.hello.com/art/dance/article/title2/nothing"  
[6] "www.hello.com/art/dance/article/title3/something"

Another example could be:

> grep(".+/title3/.+", df$url, value = TRUE)
[1] "www.hello.com/art/dance/article/title3/something"
[2] "www.hello.com/art/dance/article/title3/something"

Option 1 using dplyr:

# Create data
clickstream <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
id url                                                                           
1  www.hello.com/art/dance/article/title1/nothing
2  www.hello.com/dance/nothing
3  www.hello.com/art/dance/article/title2/nothing
4  www.hello.com/art/dance/article/title3/something
5  www.hello.com/art/dance/
6  www.hello.com/art/article/title4/nothing
7  www.hello.com/art/dance/article/title2/nothing
8  www.hello.com/art/dance/article/title3/something")

# Your pattern
regex <- "+./title3/+"
replacement <- "/TITLE-IM-LOOKING-FOR/"

# computation
library(dplyr)
clickstream %>%
  mutate(label = if_else(grepl(regex, .$url), "ChoosenLabel", "")) %>%
  mutate(url = if_else(label != "", gsub(regex, replacement, url), url))

output:

  id                                                           url        label
1  1                www.hello.com/art/dance/article/title1/nothing             
2  2                                   www.hello.com/dance/nothing             
3  3                www.hello.com/art/dance/article/title2/nothing             
4  4 www.hello.com/art/dance/articl/TITLE-IM-LOOKING-FOR/something ChoosenLabel
5  5                                      www.hello.com/art/dance/             
6  6                      www.hello.com/art/article/title4/nothing             
7  7                www.hello.com/art/dance/article/title2/nothing             
8  8 www.hello.com/art/dance/articl/TITLE-IM-LOOKING-FOR/something ChoosenLabel

Option 2 using data.table (same output):

library(data.table)
dt <- setDT(clickstream)
dt[, label := if_else(grepl(regex, url), "ChoosenLabel", "")]
dt[label != "", url := gsub(regex, replacement, url)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM