简体   繁体   English

根据 R 中前面的特定单词和后面的 % 符号提取字符串或值

[英]Extract a string or value based on specific word before and a % sign after in R

I have a Text column with thousands of rows of paragraphs, and I want to extract the values of " Capacity > x% ".我有一个包含数千行段落的 Text 列,我想提取“ Capacity > x% ”的值。 The operation sign can be >,<,=, ~... I basically need the operation sign and integer value (eg <40%) and place it in a column next to the it, same row.操作符号可以是>,<,=, ~...我基本上需要操作符号和整数值(例如<40%)并将其放在它旁边的列中,同一行。 I have tried, removing before/after text, gsub, grep , grepl, string_extract , etc. None with good results.我试过,删除之前/之后的文本、 gsub, grepgrepl, string_extract等。没有一个有好的结果。 I am not sure if the percentage sign is throwing it or I am just not getting the code structure.我不确定百分号是否抛出了它,或者我只是没有得到代码结构。 Appreciate your assistance please.感谢您的帮助。 Here are some codes I have tried (aa is the df, TEXT is col name):以下是我尝试过的一些代码(aa 是 df,TEXT 是 col 名称):

str_extract(string =aa$TEXT, pattern = perl("(?<=LVEF).*(?=%)"))

gsub(".*[Capacity]([^.]+)[%].*", "\\1", aa$TEXT)

genXtract(aa$TEXT, "Capacity", "%")

gsub("%.*$", "%", aa$TEXT)

grep("^Capacity.*%$",aa$TEXT)

Since you did not provide a reproducible example, I created one myself and used it here.由于您没有提供可重现的示例,我自己创建了一个并在此处使用。

We can use sub to extract everything after "Capacity" until a number and % sign.我们可以使用sub提取"Capacity"之后的所有内容,直到数字和%符号。

sub(".*Capacity(.*\\d+%).*", "\\1", aa$TEXT)
#[1] " > 10%"  " < 40%"  " ~ 230%"

Or with str_extract或者使用str_extract

stringr::str_extract(aa$TEXT, "(?<=Capacity).*\\d+%")

data数据

aa <- data.frame(TEXT = c("This is a temp text, Capacity > 10%", 
                    "This is a temp text, Capacity < 40%", 
                    "Capacity ~ 230% more text  ahead"), stringsAsFactors = FALSE)

gsub solution gsub 解决方案

I think your gsub solution was pretty close, but didn't bring along the percentage sign as it's outside the brackets.我认为您的 gsub 解决方案非常接近,但没有带百分比符号,因为它在括号外。 So something like this should work (the result is assigned to the capacity column):所以这样的事情应该工作(结果分配给capacity列):

aa$capacity <- gsub(".*[Capacity]([^.]+%).*", "\\\\1", aa$TEXT)

Alternative method替代方法

The gsub approach will match the whole string when there is no operator match.当没有运算符匹配时,gsub 方法将匹配整个字符串。 To avoid this, we can use the stringr package with a more specific regular expression:为了避免这种情况,我们可以使用带有更具体正则表达式的 stringr 包:

library(magrittr)
library(dplyr)
library(stringr)

aa %>% 
  mutate(capacity = str_extract(TEXT, "(?<=Capacity\\s)\\W\\s?\\d+\\s?%")) %>%
  mutate(Capacity = str_squish(Capacity)) # Remove excess white space

This code will give NA when there is no match, which I believe is your desired behaviour.当没有匹配时,此代码将给出NA ,我相信这是您想要的行为。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM