简体   繁体   English

R-正则表达式匹配模式,并且仅将模式存储在新列中?

[英]R - regex match pattern and store only the pattern in a new column?

I've been searching for hours. 我一直在找几个小时。 This should be very easy but I don't see how :( 这应该很容易,但是我不知道如何:(

I have a dataframe called ds that contains a column structured like: 我有一个称为ds的数据框,其中包含一个结构如下的列:

name
"Doe, Mr. John"
"Worth, Miss. Jane"

I want to extract the middle word and put it into a new column. 我想提取中间词并将其放入新列中。

#This is how I'm doing it now
ds$title <- NA

mr  <- grep(", Mr. ", ds$name)
miss <- grep(", Miss. ", ds$name)

ds$title[mr] <- ", Mr. "
ds$title[miss] <- ", Miss. "

I'm trying to generalize this with regex so that it'll take any middle word matching the pattern of "comma space word period space" 我正在尝试使用正则表达式对此进行概括,以便采用任何与“逗号空间单词周期空间”模式匹配的中间单词

This is my best guess but it only removes the pattern: 这是我的最佳猜测,但只会删除模式:

gsub(", .+\\.+ ", "", ds$name)

How do I keep the pattern and remove the rest? 如何保持模式并删除其余部分?

Thank you! 谢谢!

You can use a capture group. 您可以使用捕获组。 Basically, you match the whole pattern, use a capture group to match the part you want to keep, and replace the whole match with the capture group: 基本上,您匹配整个模式,使用捕获组匹配要保留的部分,然后将整个匹配项替换为捕获组:

# I often specify perl = TRUE, though it isn't necessary here
(ds$title <- gsub(".+(, .+\\.+ ).+", "\\1", ds$name, perl = TRUE))
#[1] ", Mr. "   ", Miss. "

The capture group is what's in the parentheses ( (, .+\\\\.+ ) ), and you refer back to it with \\\\1 . 捕获组是括号( (, .+\\\\.+ ) )中的内容,您可以使用\\\\1引用。 If you had a second capture group, you'd refer to it as \\\\2 . 如果您有第二个捕获组,则将其称为\\\\2

Note that if you want to catch comma, space, word, period, space, then you could modify the capture group to (, .+\\\\. ) . 请注意,如果要捕获逗号,空格,单词,句点,空格,则可以将捕获组修改为(, .+\\\\. ) You only need to match one period, not one or more. 您只需要匹配一个期间,而不是一个或多个。


A straightforward stringi alternative that does not use capture groups is stri_extract_first_regex (or in this case stri_extract_last_regex or stri_extract_all_regex work fine) 不使用捕获组的简单stringi替代方法是stri_extract_first_regex (或者在这种情况下, stri_extract_last_regexstri_extract_all_regex工作正常)

library(stringi)
ds$title <- stri_extract_first_regex(ds$name, ", .+\\. ")
#[1] ", Mr. "   ", Miss. "

and as thelatemail pointed out in a comment you can do a similar thing with base R, too, but it's a little harder to remember how to use the regmatches and regexpr functions: 并且正如thelatemail在评论中指出的那样,您也可以使用base R做类似的事情,但是要记住如何使用regmatchesregexpr函数有点困难:

regmatches(ds$name, regexpr(", .+\\. ", ds$name))
#[1] ", Mr. "   ", Miss. "

Matched capture groups are your BFF: 匹配的捕获组是您的BFF:

library(stringi)
library(purrr)

ds <- data.frame(name=c("Doe, Mr. John", "Worth, Miss. Jane"), stringsAsFactors=FALSE)

nonsp <- "[[:alnum:][:punct:]]+"
sp <- "[[:blank:]]+"

stri_match_all_regex(ds$name, nonsp %s+% sp %s+% "(" %s+% nonsp %s+% ")" %s+% sp %s+% nonsp) %>%
  map_chr(2)
## [1] "Mr."   "Miss."

For your "add column to a data frame" needs: 对于您的“将列添加到数据框”需要:

library(stringi)
library(dplyr)
library(purrr)

ds <- data.frame(name=c("Doe, Mr. John", "Worth, Miss. Jane"), stringsAsFactors=FALSE)

nonsp <- "[[:alnum:][:punct:]]+"
sp <- "[[:blank:]]+"

mutate(ds, title=stri_match_all_regex(ds$name, nonsp %s+% sp %s+% "(" %s+% nonsp %s+% ")" %s+% sp %s+% nonsp) %>% map_chr(2))
##                name title
## 1     Doe, Mr. John   Mr.
## 2 Worth, Miss. Jane Miss.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM