R中2个逗号之间的正则表达式提取文本数据

Question

我在数据帧（df）中有一堆文本，通常在1列中包含三行地址，我的目标是提取区域（文本的中心部分），例如：

73 Greenhill Gardens, Wandsworth, London
22 Acacia Heights, Lambeth, London

幸运的是，在95％的情况下，输入数据的人使用逗号分隔我想要的文本，其中100％的时间结束“，伦敦”（即逗号空间伦敦）。 为了清楚地说明事情，我的目标是在“，伦敦”和之前的逗号之后提取文本

我想要的输出是：

Wandsworth
Lambeth

我之前可以设法提取部分：

df$extraction <- sub('.*,\\s*','',address)

之后

df$extraction <- sub('.*,\\s*','',address)

但不是我需要的中间部分。 有人可以帮忙吗？

非常感谢！

Answer 1

您可以省去正则表达式的头痛并将矢量视为CSV，使用文件读取功能来提取相关部分。 我们可以使用read.csv() ，利用colClasses可用于删除列的事实。

address <- c(
    "73 Greenhill Gardens, Wandsworth, London", 
    "22 Acacia Heights, Lambeth, London"
)

read.csv(text = address, colClasses = c("NULL", "character", "NULL"), 
    header = FALSE, strip.white = TRUE)[[1L]]
# [1] "Wandsworth" "Lambeth"

或者我们可以使用fread() 。 它的select参数很好，它会自动剥离空白区域。

data.table::fread(paste(address, collapse = "\n"), 
    select = 2, header = FALSE)[[1L]]
# [1] "Wandsworth" "Lambeth"

Answer 2

以下是几种方法：

# target ", London" and the start of the string
# up until the first comma followed by a space,
# and replace with ""
gsub("^.+?, |, London", "", address)
#[1] "Wandsworth" "Lambeth"

要么

# target the whole string, but use a capture group 
# for the text before ", London" and after the first comma.
# replace the string with the captured group.
sub(".+, (.*), London", "\\1", address)
#[1] "Wandsworth" "Lambeth"

Answer 3

你可以试试这个

(?<=, )(.+?),

适用于任何数据集位置不一定在伦敦。

Answer 4

以下两个选项不依赖于城市名称相同。 第一个使用带有stringr::str_extract()的正则表达式模式：

raw_address <- c(
  "73 Greenhill Gardens, Wandsworth, London", 
  "22 Acacia Heights, Lambeth, London",
  "Street, District, City"
)

df <- data.frame(raw_address, stringsAsFactors = FALSE)

df$distict = stringr::str_extract(raw_address, '(?<=,)[^,]+(?=,)')

> df
                               raw_address     distict
1 73 Greenhill Gardens, Wandsworth, London  Wandsworth
2       22 Acacia Heights, Lambeth, London     Lambeth
3                   Street, District, City    District

第二个使用strsplit()并使地址的其他元素更容易：

df$address <- sapply(strsplit(raw_address, ',\\s*'), `[`, 1) 
df$distict <- sapply(strsplit(raw_address, ',\\s*'), `[`, 2)
df$city <- sapply(strsplit(raw_address, ',\\s*'), `[`, 3)

> df
                               raw_address              address    distict   city
1 73 Greenhill Gardens, Wandsworth, London 73 Greenhill Gardens Wandsworth London
2       22 Acacia Heights, Lambeth, London    22 Acacia Heights    Lambeth London
3                   Street, District, City               Street   District   City

如果没有空格或逗号后面有多个空格,\\\\s*则在,\\\\s*上完成拆分。

R中2个逗号之间的正则表达式提取文本数据

问题描述

4 个解决方案

解决方案1
8 2016-01-25 02:08:08

解决方案2
5 已采纳 2016-01-25 02:11:44

解决方案3
0 2016-01-25 02:26:13

解决方案4
0 2018-12-21 11:22:31

R中2个逗号之间的正则表达式提取文本数据

问题描述

4 个解决方案

解决方案1 8 2016-01-25 02:08:08

解决方案2 5 已采纳 2016-01-25 02:11:44

解决方案3 0 2016-01-25 02:26:13

解决方案4 0 2018-12-21 11:22:31

解决方案1
8 2016-01-25 02:08:08

解决方案2
5 已采纳 2016-01-25 02:11:44

解决方案3
0 2016-01-25 02:26:13

解决方案4
0 2018-12-21 11:22:31