简体   繁体   中英

Parsing through a string counting and removing infrequent characters in r

So, I have a file like this.

"1" "4" "10"     "11111111111111"
"2" "10" "22"    "11111111110111"
"3" "10" "295"   "11111111111100"
"4" "10" "584"   "11010000000000"
"5" "10" "403"   "11111111110111"
"6" "10" "281"   "11110101011110"
"7" "10" "123"   "11100000100100"
"8" "10" "127"   "11111111111111"
"9" "10"  "79"   "11110111111111"
"10" "10" "1030" "11110000000100"
................................

I want to remove those rows from this file, which have less than 4 '1' in the string in the column 4 for a particular row. For example, the row 4 contains the string "11010000000000". The string has only 3 '1', so I want to remove this row from the file.

PS: One way I thought of was to convert the string into individual characters and paste them in different columns for a particular row and then remove those rows having less than 4 '1'. Is there any other direct way I can do it?

The better string processing packages (like "stringi") will often have a feature to count the occurrences of certain patterns:

library(stringi)
stri_count_fixed(str=mydf$V4, pattern="1")
#  [1] 14 13 12  3 13 10  5 14 13  5

But you can also do this in base R.

vapply(regmatches(mydf$V4, gregexpr("1", mydf$V4)), length, 1L)
#  [1] 14 13 12  3 13 10  5 14 13  5

You can then use that vector of results to subset using basic comparison operators.

Remove the zeros and keep those rows for which what is left has at least 4 characters. Here we assume that the data is in a data frame named DF and its fourth column has name V4 :

subset(DF, nchar(gsub("0", "", V4)) >= 4)

or a bit faster but perhaps slightly less readable:

DF[nchar(gsub("0", "", DF$V4)) >= 4, ]

Added Second variation.

This is another option in base R using regular expressions:

min.num <- 4
d[grepl(paste(rep(1, min.num), collapse='.*'), d[[4]]), ]

#    V1 V2   V3             V4
# 1   1  4   10 11111111111111
# 2   2 10   22 11111111110111
# 3   3 10  295 11111111111100
# 5   5 10  403 11111111110111
# 6   6 10  281 11110101011110
# 7   7 10  123 11100000100100
# 8   8 10  127 11111111111111
# 9   9 10   79 11110111111111
# 10 10 10 1030 11110000000100

Benchmarking results:

library(microbenchmark)
library(stringi)

grothendieck <- function(DF) subset(DF, nchar(gsub("0", "", V4)) >= 4)
plourde <- function(d) d[grepl(paste(rep(1, 4), collapse='.*'), d[[4]]), ]
mahto1 <- function(mydf) mydf[stri_count_fixed(str=mydf$V4, pattern="1") > 3, ]
mahto2 <- function(mydf) mydf[vapply(regmatches(mydf$V4, gregexpr("1", mydf$V4)), length, 1L) > 3, ]

microbenchmark(grothendieck(d), plourde(d), mahto1(d), mahto2(d))

# Unit: microseconds
#             expr     min      lq  median      uq   max neval
#  grothendieck(d)  2895.7  2979.9  3003.8  3043.3  3444   100
#       plourde(d)  1280.2  1299.5  1317.6  1341.1  1542   100
#        mahto1(d)   518.2   532.3   545.8   554.5  1269   100
#        mahto2(d) 25465.3 27409.6 28447.0 29858.6 45734   100

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM