简体   繁体   English

使用 R 从长字符串中提取部分单词到字段中

[英]extract part of word into a field from a long string using R

I have a single long string variable with 3 obs.我有一个带有 3 个 obs 的长字符串变量。 I was trying to create a field prob to extract the specific string from the long string.我试图创建一个字段 prob 以从长字符串中提取特定字符串。 the code and message is below.代码和消息如下。

data aa: "The probability of being a carrier is 0.0002422359 " " an BRCA1 carrier 0.0001061067 " " an BRCA2 carrier 0.00013612 " data aa:“成为携带者的概率为 0.0002422359”“BRCA1 携带者 0.0001061067”“BRCA2 携带者 0.00013612”

enter code here aa$prob <- ifelse(grepl("The probability of being a carrier is", xx)==TRUE, word(aa, 8, 8), ifelse(grepl("BRCA", xx)==TRUE, word(aa, 5, 5), NA))在此处输入代码 aa$prob <- ifelse(grepl("成为携带者的概率是", xx)==TRUE, word(aa, 8, 8), ifelse(grepl("BRCA", xx)==TRUE , 单词(aa, 5, 5), NA))

Warning message: In aa$prob <- ifelse(grepl("The probability of being a carrier is", : Coercing LHS to a list警告消息:在 aa$prob <- ifelse(grepl("成为携带者的概率是", : 将 LHS 强制到列表中

Here is my previous answer , updated to reflect a data.frame .这是我之前的回答,已更新以反映data.frame

library(dplyr)

aa <- data.frame(aa = c("...", "...", "The probability of being a carrier is 0.0002422359 ", " an BRCA1 carrier 0.0001061067 ", " an BRCA2 carrier 0.00013612 ", "..."))

aa %>%
  mutate(prob = as.numeric(if_else(grepl("(probability|BRCA[12] carrier)", aa), 
                                   gsub("^.*?\\b([0-9]+\\.?[0-9]*)\\s*$", "\\1", aa), NA_character_)))
#                                                    aa         prob
# 1                                                 ...           NA
# 2                                                 ...           NA
# 3 The probability of being a carrier is 0.0002422359  0.0002422359
# 4                      an BRCA1 carrier 0.0001061067  0.0001061067
# 5                        an BRCA2 carrier 0.00013612  0.0001361200
# 6                                                 ...           NA

Regex walk-through:正则表达式演练:

  • ^ and $ are beginning and end of string, respective; ^$分别是字符串的开头和结尾; \\b is a word-boundary; \\b是单词边界; none of these "consume" any characters, they just mark beginnings and endings这些都不“消耗”任何字符,它们只是标记开始和结束
  • . means one character表示一个字符
  • ? means "zero or one", aka optional;表示“零或一”,又名可选; * means "zero or more"; *表示“零个或多个”; + means "one or more"; +表示“一个或多个”; all refer to the previous character/class/group都指前一个字符/类/组
  • \\s is blank space, including spaces and tabs \\s是空格,包括空格和制表符
  • [0-9] is a class, meaning any character between 0 and 9; [0-9]是 class,表示 0 到 9 之间的任何字符; similarly, [az] is all lowercase letters, [a-zA-Z] are all letters, [0-9A-F] are hexadecimal digits, etc同样, [az]都是小写字母, [a-zA-Z]都是字母, [0-9A-F]是十六进制数字等
  • (...) is a saved group; (...)是一个已保存的组; it's not uncommon in a group to use |在组中使用并不少见| as an "or";作为“或”; this group is used later in the replacement= part of gsub as numbered groups, so \\1 recalls the first group from the pattern该组稍后在gsubreplacement=部分中用作编号组,因此\\1从模式中调用第一组

So grouped and summarized:如此分组和总结:

  "^.*?\\b([0-9]+\\.?[0-9]*)\\s*$"
1         ^^^^^^^^^^^^^^^^^^
2      ^^^
3   ^^^
4                           ^^^^
  1. This is the "number" part, that allows for one or more digits, an optional decimal point, and zero or more digits.这是“数字”部分,它允许一个或多个数字、一个可选的小数点以及零个或多个数字。 This is saved in group "1".这保存在组“1”中。
  2. The word boundary guarantees that we include leading numbers (it's possible, depending on a few things, for "12.345" to be parsed as "2.345" without this.单词边界保证我们包含前导数字(根据一些事情,如果没有这个, "12.345"可能会被解析为"2.345"
  3. Anything before the number-like string.类似数字的字符串之前的任何内容。
  4. Some or no blank space after the number.数字后有一些空格或没有空格。

Grouped logically, in an organized way以有组织的方式逻辑分组

Regex isn't unique to R, it's a parsing language that R (and most other programming languages) supports in one way or another.正则表达式不是 R 独有的,它是 R(和大多数其他编程语言)以某种方式支持的解析语言。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM