简体   繁体   中英

Using regular expression in R to categorize data

I have a file with two columns, one has the content type of HTTP objects like text/html, application/rar etc and the other has the bytes size.

Content Type                                     Size
video/x-flv                                       100
image/jpeg                                        150
text/html                                         160
application/octet-stream                          200  
application/x-shockwave-flash                     ...
text/plain
application/x-javascript
text/xml
text/css
text/html; charset=utf-8
application/x-javascript; charset=utf-8           ...

As you can see there are many variations of the same content type, such as application/x-javascript and application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8 and so on. So, I would like to create another column to categorize them more generically. So, that these two would just be web/javascript and so on.

 Content Type                                      Size      Category
    video/x-flv                                       100       web/video
    image/jpeg                                        150       web/image
    text/html                                         160       web/html
    application/octet-stream                          200       web/binary
    application/x-shockwave-flash                     ...       web/flash
    text/plain                                                  web/plaintext
    application/x-javascript                                    web/javascript
    video/x-msvideo                                             web/video
    text/xml                                                    web/xml
    text/css                                                    web/css
    text/html; charset=utf-8                                    web/html
    video/quicktime                                             web/video
    application/x-javascript; charset=utf-8                     web/javascript

How would I accomplish this in R and I presume I need to use regular expressions of some sort for this?

There are several ways you can simplify your variable. Here I will use the stringr package for string manipulation functions :

R> library(stringr)

First, copy your content type variable into a new character variable :

R> d <- data.frame(type=c("video/x-flv", "image/jpeg","video/x-msvideo", "application/x-javascript; charset=utf-8", "application/x-javascript"))
R> d$type2 <- as.character(d$type)

Which just gives you :

                                     type                                   type2
1                             video/x-flv                             video/x-flv
2                              image/jpeg                              image/jpeg
3                         video/x-msvideo                         video/x-msvideo
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5                application/x-javascript                application/x-javascript

Then you can work on your new variable. You can just replace manually certain type value by another :

R> d$type2[d$type2 == "video/x-flv"] <- "video"
R> d
                                     type                                   type2
1                             video/x-flv                                   video
2                              image/jpeg                              image/jpeg
3                         video/x-msvideo                         video/x-msvideo
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5                application/x-javascript                application/x-javascript

You can use regexp matching to replace all the values matching, for example, "video" :

R> d$type2[str_detect(d$type2, ".*video.*")] <- "video"
R> d
                                     type                                   type2
1                             video/x-flv                                   video
2                              image/jpeg                              image/jpeg
3                         video/x-msvideo                                   video
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5                application/x-javascript                application/x-javascript

Or you can use regexp replacement to clean certain values. For example by removing everything behind the ";" in your content types :

R> d$type2 <- str_replace(d$type2, ";.*$", "")
R> d
                                     type                    type2
1                             video/x-flv                    video
2                              image/jpeg               image/jpeg
3                         video/x-msvideo                    video
4 application/x-javascript; charset=utf-8 application/x-javascript
5                application/x-javascript application/x-javascript

Be careful of the order of your instructions, though, as your result highly depends on it.

If you had to do it by hand, you could assign your factors into corresponding categories. In this example, I group first 13 letters of the alphabet as "1" and the second half of the letters as "2".

> x <- as.factor(sample(letters, 100, replace = TRUE))
> x
  [1] d n p n k l a x c n v p l o u e z m y x t r q b l n y s s m d u l l a d k
 [38] t a p x s g w i p l b s o t b s h h v c b j o p h f j m v d r m x o d l e
 [75] l f y l u e w f e e o s w s m v a z q l a t f z x s
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
> levels(x)
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
> levels(x) <- c(rep(1, 13), rep(2, 13))
> x
  [1] 1 2 2 2 1 1 1 2 1 2 2 2 1 2 2 1 2 1 2 2 2 2 2 1 1 2 2 2 2 1 1 2 1 1 1 1 1
 [38] 2 1 2 2 2 1 2 1 2 1 1 2 2 2 1 2 1 1 2 1 1 1 2 2 1 1 1 1 2 1 2 1 2 2 1 1 1
 [75] 1 1 2 1 2 1 2 1 1 1 2 2 2 2 1 2 1 2 2 1 1 2 1 2 2 2
Levels: 1 2
> levels(x)
[1] "1" "2"

If your example contains (only) factors ie:

"video/x-flv" "image/jpeg" "video/x-msvideo" "application/x-javascript; charset=utf-8"

... you would code your levels like so:

levels(obj) <- c("web/video", "web/image", "web/video", "web/javascript")

Assume that DF is our data frame. Define a regular expression, re to match the strings of interest and then use strapply in the gsubfn package to extract them, prefixing "web/" to each. In the strapply statement we have converted DF[[1]] to character just in case its a factor rather than a character vector. NULL entries were not matched so lets assume those are "web/binary" . Finally expand any occurrences of "plain" to "plaintext" :

> library(gsubfn)
> re <- "(video|image|html|flash|plain|javascript|xml|css).*"
> short <- strapply(as.character(DF[[1]]), re, ~ paste("web", x, sep = "/"))
> DF$short <- sapply(short, function(x) if (is.null(x)) "web/binary" else x)
> DF$short <- sub("plain", "plaintext", DF$short)
> DF
                                   Content          short
1                              video/x-flv      web/video
2                               image/jpeg      web/image
3                                text/html       web/html
4                 application/octet-stream     web/binary
5            application/x-shockwave-flash      web/flash
6                               text/plain  web/plaintext
7                 application/x-javascript web/javascript
8                          video/x-msvideo      web/video
9                                 text/xml        web/xml
10                                text/css        web/css
11                text/html; charset=utf-8       web/html
12                         video/quicktime      web/video
13 application/x-javascript; charset=utf-8 web/javascript

There is more info on the gsubfn package at http://gsubfn.googlecode.com .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM