[英]extract part of word into a field from a long string using R
I have a single long string variable with 3 obs.我有一个带有 3 个 obs 的长字符串变量。 I was trying to create a field prob to extract the specific string from the long string.
我试图创建一个字段 prob 以从长字符串中提取特定字符串。 the code and message is below.
代码和消息如下。
data aa: "The probability of being a carrier is 0.0002422359 " " an BRCA1 carrier 0.0001061067 " " an BRCA2 carrier 0.00013612 " data aa:“成为携带者的概率为 0.0002422359”“BRCA1 携带者 0.0001061067”“BRCA2 携带者 0.00013612”
enter code here aa$prob <- ifelse(grepl("The probability of being a carrier is", xx)==TRUE, word(aa, 8, 8), ifelse(grepl("BRCA", xx)==TRUE, word(aa, 5, 5), NA))在此处输入代码 aa$prob <- ifelse(grepl("成为携带者的概率是", xx)==TRUE, word(aa, 8, 8), ifelse(grepl("BRCA", xx)==TRUE , 单词(aa, 5, 5), NA))
Warning message: In aa$prob <- ifelse(grepl("The probability of being a carrier is", : Coercing LHS to a list警告消息:在 aa$prob <- ifelse(grepl("成为携带者的概率是", : 将 LHS 强制到列表中
Here is my previous answer , updated to reflect a data.frame
.这是我之前的回答,已更新以反映
data.frame
。
library(dplyr)
aa <- data.frame(aa = c("...", "...", "The probability of being a carrier is 0.0002422359 ", " an BRCA1 carrier 0.0001061067 ", " an BRCA2 carrier 0.00013612 ", "..."))
aa %>%
mutate(prob = as.numeric(if_else(grepl("(probability|BRCA[12] carrier)", aa),
gsub("^.*?\\b([0-9]+\\.?[0-9]*)\\s*$", "\\1", aa), NA_character_)))
# aa prob
# 1 ... NA
# 2 ... NA
# 3 The probability of being a carrier is 0.0002422359 0.0002422359
# 4 an BRCA1 carrier 0.0001061067 0.0001061067
# 5 an BRCA2 carrier 0.00013612 0.0001361200
# 6 ... NA
Regex walk-through:正则表达式演练:
^
and $
are beginning and end of string, respective; ^
和$
分别是字符串的开头和结尾; \\b
is a word-boundary; \\b
是单词边界; none of these "consume" any characters, they just mark beginnings and endings.
means one character?
means "zero or one", aka optional;*
means "zero or more"; *
表示“零个或多个”; +
means "one or more"; +
表示“一个或多个”; all refer to the previous character/class/group\\s
is blank space, including spaces and tabs \\s
是空格,包括空格和制表符[0-9]
is a class, meaning any character between 0 and 9; [0-9]
是 class,表示 0 到 9 之间的任何字符; similarly, [az]
is all lowercase letters, [a-zA-Z]
are all letters, [0-9A-F]
are hexadecimal digits, etc[az]
都是小写字母, [a-zA-Z]
都是字母, [0-9A-F]
是十六进制数字等(...)
is a saved group; (...)
是一个已保存的组; it's not uncommon in a group to use |
|
as an "or";replacement=
part of gsub
as numbered groups, so \\1
recalls the first group from the patterngsub
的replacement=
部分中用作编号组,因此\\1
从模式中调用第一组So grouped and summarized:如此分组和总结:
"^.*?\\b([0-9]+\\.?[0-9]*)\\s*$"
1 ^^^^^^^^^^^^^^^^^^
2 ^^^
3 ^^^
4 ^^^^
"12.345"
to be parsed as "2.345"
without this."12.345"
可能会被解析为"2.345"
。 Grouped logically, in an organized way以有组织的方式逻辑分组
Regex isn't unique to R, it's a parsing language that R (and most other programming languages) supports in one way or another.正则表达式不是 R 独有的,它是 R(和大多数其他编程语言)以某种方式支持的解析语言。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.