简体   繁体   English

使用 gsub() 从 R 中的数组中提取数字

[英]Using gsub() to extract numbers from an array in R

I want to remove & and .我想删除& from the following array and extract the numbers only,从以下数组中仅提取数字,

x = as.factor(c(".&.", "0.0119885482338&.&.", ".&2.25880593895", ".&.&.&.&.&.&.&.", ".&0.295142083575&.", "0.708323350364",".&.&0.193766679861",".&.&.&.&7.65239874523E-4&.&."))

I tried the following gsub() command:我尝试了以下gsub()命令:

gsub("[^0-9.E-]","",x)

The output: output:

".."                     "0.0119885482338.."      ".2.25880593895"         
"........"              
".0.295142083575."       "0.708323350364"         "..0.193766679861"       
"....7.65239874523E-4.." 

Any suggestions to update the above gsub command so that the output will look like:更新上述gsub命令的任何建议,以便 output 如下所示:

"" "0.0119885482338" "2.25880593895" "" "0.295142083575" 
"0.708323350364" "0.193766679861" "7.65239874523E-4"  

You can use您可以使用

> sub("^.*?(?:([-+]?\\d*\\.?\\d+(?:[eE][-+]?\\d+)?).*|$)","\\1",x)
[1] ""                 "0.0119885482338"  "2.25880593895"    ""                 "0.295142083575"   "0.708323350364"   "0.193766679861"   "7.65239874523E-4"

See the regex demo .请参阅正则表达式演示

Details :详情

  • ^ - start of string ^ - 字符串的开头
  • .*? - any text, as short as possible - 任何文本,尽可能短
  • (?: - start of a non-capturing group: (?: - 非捕获组的开始:
    • ([-+]?\\d*\\.?\\d+(?:[eE][-+]?\\d+)?) - Group 1 ( \1 ): a number pattern ([-+]?\\d*\\.?\\d+(?:[eE][-+]?\\d+)?) - 组 1 ( \1 ): 数字模式
    • .* - the rest of the string .* - 字符串的 rest
  • |
    • $ - end of string $ - 字符串结尾
  • ) - end of the non-capturing group. ) - 非捕获组的结束。

See an online R demo :请参阅在线 R 演示

x=as.factor(c(".&.", "0.0119885482338&.&.", ".&2.25880593895", ".&.&.&.&.&.&.&.", ".&0.295142083575&.", "0.708323350364",".&.&0.193766679861",".&.&.&.&7.65239874523E-4&.&."))
sub("^.*?(?:([-+]?\\d*\\.?\\d+(?:[eE][-+]?\\d+)?).*|$)","\\1",x)
## => [1] ""                 "0.0119885482338"  "2.25880593895"    ""                
##    [5] "0.295142083575"   "0.708323350364"   "0.193766679861"   "7.65239874523E-4"

Here is a base R approach using grepl followed by sub :这是使用grepl后跟sub的基本 R 方法:

x <- x[grepl("\\d+", x)]
x <- sub("^.*?(\\d+(?:\\.\\d+)?(?:E[-+]\\d+)?).*$", "\\1", x)
x

[1] "0.0119885482338"  "2.25880593895"    "0.295142083575"   "0.708323350364"  
[5] "0.193766679861"   "7.65239874523E-4"

In the alternatives below remove as.numeric at the end if you want the result to be character.如果您希望结果为字符,请在下面的替代方案中删除最后的 as.numeric。

1) The following does not use regular expressions. 1)以下不使用正则表达式。 The form of the input shown in the question is & separated fields so it converts x from factor to character, splits it into fields separated by &, removes any dot that is in a field by itself and then converts the remainder to numeric.问题中显示的输入形式是 & 分隔字段,因此它将 x 从因子转换为字符,将其拆分为由 & 分隔的字段,单独删除字段中的任何点,然后将余数转换为数字。 No packages are used.不使用任何包。

s <- unlist(strsplit(paste(x), "&", fixed = TRUE))
as.numeric(s[s != "."])
## [1] 0.0119885482 2.2588059390 0.2951420836 0.7083233504 0.1937666799
## [6] 0.0007652399

Alternately, we could represent it as a pipeline或者,我们可以将其表示为管道

library(magrittr)

x %>%
  paste %>%
  strsplit("&", fixed = TRUE) %>%
  unlist %>%
  Filter(function(x) x != ".", .) %>%
  as.numeric
## [1] 0.0119885482 2.2588059390 0.2951420836 0.7083233504 0.1937666799
## [6] 0.0007652399

2) The approach in the question can work if we remove the leading and trailing dots afterwards, remove zero length fields and convert to numeric 2)如果我们之后删除前导点和尾随点,删除零长度字段并转换为数字,则问题中的方法可以工作

as.numeric(Filter(nzchar, trimws(gsub("[^0-9.E-]","",x),, whitespace = "\\.")))
## [1] 0.0119885482 2.2588059390 0.2951420836 0.7083233504 0.1937666799
## [6] 0.0007652399

Update更新

In a comment it was mentioned that it is desired that the result be the same length as the input.在评论中提到,希望结果与输入的长度相同。 Assuming that in that case we want character output we can shorten the above to the following:假设在这种情况下我们想要字符 output 我们可以将上面的内容缩短为以下内容:

L <- strsplit(paste(x), "&", fixed = TRUE)
sapply(L, function(x) c(x[x != "."], "")[1])
## [1] ""                 "0.0119885482338"  "2.25880593895"    ""                
## [5] "0.295142083575"   "0.708323350364"   "0.193766679861"   "7.65239874523E-4"

x %>% paste %>% strsplit("&", fixed = TRUE) %>% sapply(function(x) c(x[x != "."], "")[1])
## [1] ""                 "0.0119885482338"  "2.25880593895"    ""                
## [5] "0.295142083575"   "0.708323350364"   "0.193766679861"   "7.65239874523E-4"


trimws(gsub("[^0-9.E-]","",x), whitespace = "\\.")
## [1] ""                 "0.0119885482338"  "2.25880593895"    ""                
## [5] "0.295142083575"   "0.708323350364"   "0.193766679861"   "7.65239874523E-4"

In case .万一. and & are always together (in your given example that's the case) you can use \\.*&\\.* .&总是在一起(在你给定的例子中就是这种情况)你可以使用\\.*&\\.*

gsub("\\.*&\\.*", "", x)
#[1] ""                 "0.0119885482338"  "2.25880593895"    ""                
#[5] "0.295142083575"   "0.708323350364"   "0.193766679861"   "7.65239874523E-4"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM