简体   繁体   English

(提取/分离/匹配)组中的任何顺序

[英](Extract/Separate/Match) Groups in Any Order

# Sample Data Frame
df  <- data.frame(Column_A 
                  =c("1011 Red Cat", 
                     "Mouse 2011 is in the House 3001", "Yellow on Blue Dog walked around Park"))

I've a column of manually inputted data which I'm trying to clean. 我有一列要清除的手动输入数据。

  Column_A 
1|1011 Red Cat                         |
2|Mouse 2011 is in the House 3001      |
2|Yellow on Blue Dog walked around Park|  

I want to separate each characteristic into it's own column, but still maintain Column A to pull out other characteristics later. 我想将每个特征分成其自己的列,但仍保留列A以在以后提取其他特征。

  Colour               Code           Column_A
1|Red                 |1001          |Cat
2|NA                  |2001 3001     |Mouse is in the House
3|Yellow on Blue      |NA            |Dog walked around Park

To date, I've been re-ordering them with gsub and capturing groups, then using Tidyr::extract to separate them. 到目前为止,我一直在用gsub重新排列它们并捕获组,然后使用Tidyr :: extract分离它们。

library(dplyr)
library(tidyr)
library(stringr)

df1 <- df %>% 

  # Reorders the Colours
  mutate(Column_A = gsub("(.*?)?(Yellow|Blue|Red)(.*)?", "\\2 \\1\\3", 
                         Column_A, perl = TRUE)) %>%
  # Removes Whitespaces 
  mutate(Column_A =str_squish(Column_A)) %>%
  # Extracts the Colours 
  extract(Column_A, c("Colour", "Column_A"), "(Red|Yellow|Blue)?(.*)") %>%

  # Repeats the Prececding Steps for Codes
  mutate(Column_A = gsub("(.*?)?(\\b\\d{1,}\\b)(.*)?", "\\2 \\1\\3", 
                         Column_A, perl = TRUE)) %>%
  mutate(Column_A =str_squish(Column_A)) %>%
  extract(Column_A, c("Code", "Column_A"), "(\\b\\d{1,}\\b)?(.*)") %>%
  mutate(Column_A = str_squish(Column_A))

Which Results in this: 结果如下:

Colour      Code    Column_A
|Red        |1011   |Cat
|Yellow     |NA     |on Blue Dog walked around Park
|NA         |1011   |Mouse is in the House 1001

This works fine for the first row, but not the proceeding space and word separated ones, which I've subsequently been extracting and uniting. 这对于第一行工作正常,但不适用于行进空间和单词分隔的行,我随后一直在提取和合并它们。 What's a more elegant way of doing this? 有什么更优雅的方法?

Here's a solution with a combination of stringr and gsub , using a list of colours supplied in R: 这是结合使用stringrgsub的解决方案,使用R中提供的颜色列表:

library(dplyr)
library(stringr)

# list of colours from R colors()
cols <- as.character(colors())

apply(df,
      1,
      function(x)

        tibble(
          # Exctract CSV of colours
          Color = cols[cols %in% str_split(tolower(x), " ", simplify = T)] %>%
            paste0(collapse = ","),

          # Extract CSV of sequential lists of digits
          Code = str_extract_all(x, regex("\\d+"), simplify = T) %>%
            paste0(collapse = ","),

          # Remove colours and digits from Column_A
          Column_A = gsub(paste0("(\\d+|",
                                 paste0(cols, collapse = "|"),
                                 ")"), "", x, ignore.case = T) %>% trimws())) %>%
  bind_rows()

# A tibble: 3 x 3
  Color       Code      Column_A                  
  <chr>       <chr>     <chr>                     
1 red         1011      Cat                       
2 ""          2011,3001 Mouse  is in the House    
3 blue,yellow ""        on  Dog walked around Park

Using tidyverse we can do 使用tidyverse我们可以做

library(tidyverse)

colors <- paste0(c("Red", "Yellow", "Blue"), collapse = "|")

df %>%
   mutate(Color = str_extract(Column_A,
                   paste0("(", colors, ").*(", colors, ")|(", colors, ")")),
           Code = str_extract_all(Column_A, "\\d+", ), 
           Column_A = pmap_chr(list(Color, Code, Column_A), function(x, y, z) 
              trimws(gsub(paste0("\\b", c(x,  y), "\\b", collapse = "|"), "", z))), 
           Code = map_chr(Code, paste, collapse = " "))


#                 Column_A         Color      Code
#1                    Cat            Red      1011
#2 Mouse  is in the House           <NA> 2011 3001
#3 Dog walked around Park Yellow on Blue      

We first extract text between two colors using str_extract . 我们首先使用str_extract在两种colors之间提取文本。 You can include all the possible colors which can occur in the data in colors . 您可以包括所有可能发生在数据的可能的颜色colors We use paste0 to construct the regex. 我们使用paste0构造正则表达式。 For this example it would be 对于这个例子

paste0("(", colors, ").*(", colors, ")|(", colors, ")")
#[1] "(Red|Yellow|Blue).*(Red|Yellow|Blue)|(Red|Yellow|Blue)"

meaning extract text between and including colors or extract only colors . 意思是提取colors之间(包括colors或仅提取colors

For Code part as we can have multiple Code values, we use str_extract_all and get all the numbers from the column. 对于Code部分,因为我们可以有多个Code值,所以我们使用str_extract_all并从列中获取所有数字。 This part is initially stored in a list. 此部分最初存储在列表中。

For Column_A values we remove everything which was selected in Code and Color adding word boundaries using gsub and the remaining part is saved. 对于Column_A值,我们将删除在CodeColor选择的所有内容,并使用gsub添加单词边界,其余部分将保存。

As we had stored Code in list previously, we convert them to one string by collapsing them. 正如我们之前将Code存储在列表中一样,我们通过折叠将它们转换为一个字符串。 This returns empty strings for values that do not match. 这将为不匹配的值返回空字符串。 You can convert them back to NA by adding Code = replace(Code, Code == "", NA)) in the chain if needed. 如果需要,可以通过在链中添加Code = replace(Code, Code == "", NA))将它们转换回NA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM