R-正则表达式匹配模式，并且仅将模式存储在新列中？

Question

I've been searching for hours. 我一直在找几个小时。 This should be very easy but I don't see how :( 这应该很容易，但是我不知道如何:(

I have a dataframe called ds that contains a column structured like: 我有一个称为ds的数据框，其中包含一个结构如下的列：

name
"Doe, Mr. John"
"Worth, Miss. Jane"

I want to extract the middle word and put it into a new column. 我想提取中间词并将其放入新列中。

#This is how I'm doing it now
ds$title <- NA

mr  <- grep(", Mr. ", ds$name)
miss <- grep(", Miss. ", ds$name)

ds$title[mr] <- ", Mr. "
ds$title[miss] <- ", Miss. "

I'm trying to generalize this with regex so that it'll take any middle word matching the pattern of "comma space word period space" 我正在尝试使用正则表达式对此进行概括，以便采用任何与“逗号空间单词周期空间”模式匹配的中间单词

This is my best guess but it only removes the pattern: 这是我的最佳猜测，但只会删除模式：

gsub(", .+\\.+ ", "", ds$name)

How do I keep the pattern and remove the rest? 如何保持模式并删除其余部分？

Thank you! 谢谢！

Answer 1

You can use a capture group. 您可以使用捕获组。 Basically, you match the whole pattern, use a capture group to match the part you want to keep, and replace the whole match with the capture group: 基本上，您匹配整个模式，使用捕获组匹配要保留的部分，然后将整个匹配项替换为捕获组：

# I often specify perl = TRUE, though it isn't necessary here
(ds$title <- gsub(".+(, .+\\.+ ).+", "\\1", ds$name, perl = TRUE))
#[1] ", Mr. "   ", Miss. "

The capture group is what's in the parentheses ( (, .+\\\\.+ ) ), and you refer back to it with \\\\1 . 捕获组是括号（ (, .+\\\\.+ ) ）中的内容，您可以使用\\\\1引用。 If you had a second capture group, you'd refer to it as \\\\2 . 如果您有第二个捕获组，则将其称为\\\\2 。

Note that if you want to catch comma, space, word, period, space, then you could modify the capture group to (, .+\\\\. ) . 请注意，如果要捕获逗号，空格，单词，句点，空格，则可以将捕获组修改为(, .+\\\\. ) 。 You only need to match one period, not one or more. 您只需要匹配一个期间，而不是一个或多个。

A straightforward stringi alternative that does not use capture groups is stri_extract_first_regex (or in this case stri_extract_last_regex or stri_extract_all_regex work fine) 不使用捕获组的简单stringi替代方法是stri_extract_first_regex （或者在这种情况下， stri_extract_last_regex或stri_extract_all_regex工作正常）

library(stringi)
ds$title <- stri_extract_first_regex(ds$name, ", .+\\. ")
#[1] ", Mr. "   ", Miss. "

and as thelatemail pointed out in a comment you can do a similar thing with base R, too, but it's a little harder to remember how to use the regmatches and regexpr functions: 并且正如thelatemail在评论中指出的那样，您也可以使用base R做类似的事情，但是要记住如何使用regmatches和regexpr函数有点困难：

regmatches(ds$name, regexpr(", .+\\. ", ds$name))
#[1] ", Mr. "   ", Miss. "

Answer 2

Matched capture groups are your BFF: 匹配的捕获组是您的BFF：

library(stringi)
library(purrr)

ds <- data.frame(name=c("Doe, Mr. John", "Worth, Miss. Jane"), stringsAsFactors=FALSE)

nonsp <- "[[:alnum:][:punct:]]+"
sp <- "[[:blank:]]+"

stri_match_all_regex(ds$name, nonsp %s+% sp %s+% "(" %s+% nonsp %s+% ")" %s+% sp %s+% nonsp) %>%
  map_chr(2)
## [1] "Mr."   "Miss."

For your "add column to a data frame" needs: 对于您的“将列添加到数据框”需要：

library(stringi)
library(dplyr)
library(purrr)

ds <- data.frame(name=c("Doe, Mr. John", "Worth, Miss. Jane"), stringsAsFactors=FALSE)

nonsp <- "[[:alnum:][:punct:]]+"
sp <- "[[:blank:]]+"

mutate(ds, title=stri_match_all_regex(ds$name, nonsp %s+% sp %s+% "(" %s+% nonsp %s+% ")" %s+% sp %s+% nonsp) %>% map_chr(2))
##                name title
## 1     Doe, Mr. John   Mr.
## 2 Worth, Miss. Jane Miss.

R-正则表达式匹配模式，并且仅将模式存储在新列中？

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-01-10 01:05:12

解决方案2
1 2017-01-10 01:06:05

R-正则表达式匹配模式，并且仅将模式存储在新列中？

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-01-10 01:05:12

解决方案2 1 2017-01-10 01:06:05

解决方案1
3 已采纳 2017-01-10 01:05:12

解决方案2
1 2017-01-10 01:06:05