[英](Extract/Separate/Match) Groups in Any Order
# Sample Data Frame
df <- data.frame(Column_A
=c("1011 Red Cat",
"Mouse 2011 is in the House 3001", "Yellow on Blue Dog walked around Park"))
I've a column of manually inputted data which I'm trying to clean. 我有一列要清除的手动输入数据。
Column_A
1|1011 Red Cat |
2|Mouse 2011 is in the House 3001 |
2|Yellow on Blue Dog walked around Park|
I want to separate each characteristic into it's own column, but still maintain Column A to pull out other characteristics later. 我想将每个特征分成其自己的列,但仍保留列A以在以后提取其他特征。
Colour Code Column_A
1|Red |1001 |Cat
2|NA |2001 3001 |Mouse is in the House
3|Yellow on Blue |NA |Dog walked around Park
To date, I've been re-ordering them with gsub and capturing groups, then using Tidyr::extract to separate them. 到目前为止,我一直在用gsub重新排列它们并捕获组,然后使用Tidyr :: extract分离它们。
library(dplyr)
library(tidyr)
library(stringr)
df1 <- df %>%
# Reorders the Colours
mutate(Column_A = gsub("(.*?)?(Yellow|Blue|Red)(.*)?", "\\2 \\1\\3",
Column_A, perl = TRUE)) %>%
# Removes Whitespaces
mutate(Column_A =str_squish(Column_A)) %>%
# Extracts the Colours
extract(Column_A, c("Colour", "Column_A"), "(Red|Yellow|Blue)?(.*)") %>%
# Repeats the Prececding Steps for Codes
mutate(Column_A = gsub("(.*?)?(\\b\\d{1,}\\b)(.*)?", "\\2 \\1\\3",
Column_A, perl = TRUE)) %>%
mutate(Column_A =str_squish(Column_A)) %>%
extract(Column_A, c("Code", "Column_A"), "(\\b\\d{1,}\\b)?(.*)") %>%
mutate(Column_A = str_squish(Column_A))
Which Results in this: 结果如下:
Colour Code Column_A
|Red |1011 |Cat
|Yellow |NA |on Blue Dog walked around Park
|NA |1011 |Mouse is in the House 1001
This works fine for the first row, but not the proceeding space and word separated ones, which I've subsequently been extracting and uniting. 这对于第一行工作正常,但不适用于行进空间和单词分隔的行,我随后一直在提取和合并它们。 What's a more elegant way of doing this? 有什么更优雅的方法?
Here's a solution with a combination of stringr
and gsub
, using a list of colours supplied in R: 这是结合使用stringr
和gsub
的解决方案,使用R中提供的颜色列表:
library(dplyr)
library(stringr)
# list of colours from R colors()
cols <- as.character(colors())
apply(df,
1,
function(x)
tibble(
# Exctract CSV of colours
Color = cols[cols %in% str_split(tolower(x), " ", simplify = T)] %>%
paste0(collapse = ","),
# Extract CSV of sequential lists of digits
Code = str_extract_all(x, regex("\\d+"), simplify = T) %>%
paste0(collapse = ","),
# Remove colours and digits from Column_A
Column_A = gsub(paste0("(\\d+|",
paste0(cols, collapse = "|"),
")"), "", x, ignore.case = T) %>% trimws())) %>%
bind_rows()
# A tibble: 3 x 3
Color Code Column_A
<chr> <chr> <chr>
1 red 1011 Cat
2 "" 2011,3001 Mouse is in the House
3 blue,yellow "" on Dog walked around Park
Using tidyverse
we can do 使用tidyverse
我们可以做
library(tidyverse)
colors <- paste0(c("Red", "Yellow", "Blue"), collapse = "|")
df %>%
mutate(Color = str_extract(Column_A,
paste0("(", colors, ").*(", colors, ")|(", colors, ")")),
Code = str_extract_all(Column_A, "\\d+", ),
Column_A = pmap_chr(list(Color, Code, Column_A), function(x, y, z)
trimws(gsub(paste0("\\b", c(x, y), "\\b", collapse = "|"), "", z))),
Code = map_chr(Code, paste, collapse = " "))
# Column_A Color Code
#1 Cat Red 1011
#2 Mouse is in the House <NA> 2011 3001
#3 Dog walked around Park Yellow on Blue
We first extract text between two colors
using str_extract
. 我们首先使用str_extract
在两种colors
之间提取文本。 You can include all the possible colors which can occur in the data in colors
. 您可以包括所有可能发生在数据的可能的颜色colors
。 We use paste0
to construct the regex. 我们使用paste0
构造正则表达式。 For this example it would be 对于这个例子
paste0("(", colors, ").*(", colors, ")|(", colors, ")")
#[1] "(Red|Yellow|Blue).*(Red|Yellow|Blue)|(Red|Yellow|Blue)"
meaning extract text between and including colors
or extract only colors
. 意思是提取colors
之间(包括colors
或仅提取colors
。
For Code
part as we can have multiple Code
values, we use str_extract_all
and get all the numbers from the column. 对于Code
部分,因为我们可以有多个Code
值,所以我们使用str_extract_all
并从列中获取所有数字。 This part is initially stored in a list. 此部分最初存储在列表中。
For Column_A
values we remove everything which was selected in Code
and Color
adding word boundaries using gsub
and the remaining part is saved. 对于Column_A
值,我们将删除在Code
和Color
选择的所有内容,并使用gsub
添加单词边界,其余部分将保存。
As we had stored Code
in list previously, we convert them to one string by collapsing them. 正如我们之前将Code
存储在列表中一样,我们通过折叠将它们转换为一个字符串。 This returns empty strings for values that do not match. 这将为不匹配的值返回空字符串。 You can convert them back to NA
by adding Code = replace(Code, Code == "", NA))
in the chain if needed. 如果需要,可以通过在链中添加Code = replace(Code, Code == "", NA))
将它们转换回NA
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.