R 正则表达式提取独立字符

Question

I'm trying to extract letter R or O that stands alone from multiple columns.我正在尝试从多列中提取独立的字母 R 或 O。 By standalone, I mean R or O (i) separated by space or (ii) that is the only value in a cell.通过独立，我的意思是 R 或 O (i) 由空格分隔或 (ii) 这是单元格中的唯一值。 Here's a reproducible example.这是一个可重现的示例。 Suppose I want to extract standalone R or O from column X1 and X2 .假设我想从X1和X2列中提取独立的 R 或 O 。

df <- data.frame(X1 = c( "EHO", "X 1 R","R"),
                 X2 = c( "Y R E", "X A 1", "AER"), 
                 X3 = NA)

Here's desired outcome.这是期望的结果。

data.frame(X1 = c("", "R", "R"),
           X2 = c("R", "", ""))

Here's what I've tried so far.这是我到目前为止所尝试的。 The first approach is problematic because R from AER and O from EHO is extracted (also R from "Y R E" is not extracted).第一种方法是有问题的，因为提取了来自 AER 的 R 和来自 EHO 的 O（也没有提取来自“Y R E”的 R）。

require(stringr)
sapply(df[,1:2], function(x) ifelse( df$X3 %in% NA, str_extract(x, "\\s?[O|R]$"), X3))

So I've tried this, which solves above problem, but now it fails to extract R from df[3,1] .所以我试过这个，它解决了上述问题，但现在它无法从df[3,1]中提取 R 。

sapply(df[,1:2], function(x) ifelse( df$X3 %in% NA, str_extract(x, "(?![A-Z]+?)\\s?[O|R]$?"), X3))

How should I fix the pattern to get this?我应该如何修复模式来获得这个？

Answer 1

You can use word boundaries:您可以使用单词边界：

sapply(df, stringr::str_extract, '\\b[RO]\\b')

#     X1  X2  X3
#[1,] NA  "R" NA
#[2,] "R" NA  NA
#[3,] "R" NA  NA

However, note that str_extract will extract only one of "R" or "O" whichever comes first.但是，请注意str_extract将仅提取"R"或"O"中的一个，以先到者为准。

stringr::str_extract('EH R O', '\\b[RO]\\b')
#[1] "R"

If you want to extract both of them you might need to use str_extract_all .如果你想提取它们，你可能需要使用str_extract_all 。

R 正则表达式提取独立字符

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-06-23 00:46:02

R 正则表达式提取独立字符

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-06-23 00:46:02

解决方案1
1 已采纳 2020-06-23 00:46:02