简体   繁体   English

解析字符串计数并删除r中不常见的字符

[英]Parsing through a string counting and removing infrequent characters in r

So, I have a file like this. 所以,我有一个像这样的文件。

"1" "4" "10"     "11111111111111"
"2" "10" "22"    "11111111110111"
"3" "10" "295"   "11111111111100"
"4" "10" "584"   "11010000000000"
"5" "10" "403"   "11111111110111"
"6" "10" "281"   "11110101011110"
"7" "10" "123"   "11100000100100"
"8" "10" "127"   "11111111111111"
"9" "10"  "79"   "11110111111111"
"10" "10" "1030" "11110000000100"
................................

I want to remove those rows from this file, which have less than 4 '1' in the string in the column 4 for a particular row. 我想从这个文件中删除那些行,这些行在特定行的第4列的字符串中小于4'1'。 For example, the row 4 contains the string "11010000000000". 例如,第4行包含字符串“11010000000000”。 The string has only 3 '1', so I want to remove this row from the file. 字符串只有3'1',所以我想从文件中删除这一行。

PS: One way I thought of was to convert the string into individual characters and paste them in different columns for a particular row and then remove those rows having less than 4 '1'. PS:我想到的一种方法是将字符串转换为单个字符并将它们粘贴到特定行的不同列中,然后删除那些小于4'1'的行。 Is there any other direct way I can do it? 有没有其他直接的方法可以做到这一点?

The better string processing packages (like "stringi") will often have a feature to count the occurrences of certain patterns: 更好的字符串处理包(如“stringi”)通常会有一个功能来计算某些模式的出现次数:

library(stringi)
stri_count_fixed(str=mydf$V4, pattern="1")
#  [1] 14 13 12  3 13 10  5 14 13  5

But you can also do this in base R. 但你也可以在R基地做到这一点。

vapply(regmatches(mydf$V4, gregexpr("1", mydf$V4)), length, 1L)
#  [1] 14 13 12  3 13 10  5 14 13  5

You can then use that vector of results to subset using basic comparison operators. 然后,您可以使用基本比较运算符将结果向量用于子集。

Remove the zeros and keep those rows for which what is left has at least 4 characters. 删除零并保留剩下的行至少包含4个字符的行。 Here we assume that the data is in a data frame named DF and its fourth column has name V4 : 这里我们假设数据位于名为DF的数据框中,其第四列的名称为V4

subset(DF, nchar(gsub("0", "", V4)) >= 4)

or a bit faster but perhaps slightly less readable: 或者更快但可能稍微不那么可读:

DF[nchar(gsub("0", "", DF$V4)) >= 4, ]

Added Second variation. 添加第二个变体。

This is another option in base R using regular expressions: 这是使用正则表达式的基本R中的另一个选项:

min.num <- 4
d[grepl(paste(rep(1, min.num), collapse='.*'), d[[4]]), ]

#    V1 V2   V3             V4
# 1   1  4   10 11111111111111
# 2   2 10   22 11111111110111
# 3   3 10  295 11111111111100
# 5   5 10  403 11111111110111
# 6   6 10  281 11110101011110
# 7   7 10  123 11100000100100
# 8   8 10  127 11111111111111
# 9   9 10   79 11110111111111
# 10 10 10 1030 11110000000100

Benchmarking results: 基准测试结果:

library(microbenchmark)
library(stringi)

grothendieck <- function(DF) subset(DF, nchar(gsub("0", "", V4)) >= 4)
plourde <- function(d) d[grepl(paste(rep(1, 4), collapse='.*'), d[[4]]), ]
mahto1 <- function(mydf) mydf[stri_count_fixed(str=mydf$V4, pattern="1") > 3, ]
mahto2 <- function(mydf) mydf[vapply(regmatches(mydf$V4, gregexpr("1", mydf$V4)), length, 1L) > 3, ]

microbenchmark(grothendieck(d), plourde(d), mahto1(d), mahto2(d))

# Unit: microseconds
#             expr     min      lq  median      uq   max neval
#  grothendieck(d)  2895.7  2979.9  3003.8  3043.3  3444   100
#       plourde(d)  1280.2  1299.5  1317.6  1341.1  1542   100
#        mahto1(d)   518.2   532.3   545.8   554.5  1269   100
#        mahto2(d) 25465.3 27409.6 28447.0 29858.6 45734   100

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM