简体   繁体   English

使用Regex和split(tidyr)分隔R中的列

[英]Separating a column in R using Regex & separate (tidyr)

This is what I am looking to be able to do. 这就是我希望能够做到的。
https://regex101.com/r/KchccA/1 https://regex101.com/r/KchccA/1

I want to match on any characters in-between = and ) while also considering if there is a null captured group, as I want all fields to be populated per row. 我想匹配=和)之间的任何字符,同时还要考虑是否有一个空的捕获组,因为我希望每行填充所有字段。

Example of a row: In this example Address4, County, and Contact name are null. 行示例:在此示例中,Address4,County和Contact name为空。 You can also see how some have wrong / incorrect values. 您还可以查看某些值有误/不正确。 Theres also some initial / ending text too I need to remove. 还有一些初始/结束文本,我也需要删除。

x <- "Please enter an UT location before booking the order.. ADDRESS_VALIDATION_FAILED (SITE_TYPE=uct) (SITE_USE_ID=1000) (CUSTOMER_NAME=cname) (CUSTOMER_NUMBER=2000) (ADDRESS1=addy1) (ADDRESS2=addy2) (ADDRESS3=addy3) (ADDRESS4=) (CITY=.) (STATE=) (ZIP=0000) (COUNTY=) (COUNTRY=NO) (CONTACT_NAME=) The task is raised for line_number: 7"

However in R when I try to utilize tidyr's separate method I end up with undesirable results. 但是,在R中,当我尝试使用tidyr的单独方法时,最终会出现不良结果。 Am I not escaping it right? 我不逃避吗?

Here was my code for it: 这是我的代码:

df.sub <- separate(data = main.data, col = Order.Task.Text.CCW, into = c("SITE_TYPE", "SITE_USE_ID", "CUSTOMER_NAME","CUSTOMER_NUMBER", "ADDRESS1", "ADDRESS2", "ADDRESS3", "ADDRESS4", "CITY", "STATE", "ZIP", "COUNTY", "COUNTRY", "CONTACT_NAME"), sep = "=([^\\)]+|())\\)")

Example of Results: 结果示例:

   SITE_TYPE    SITE_USE_ID   CUSTOMER_NAME      CUSTOMER_NUMBER       
1  (SITE_TYPE    (SITE_USE_ID    (CUSTOMER_NAME  (CUSTOMER_NUMBER
2  (SITE_TYPE    (SITE_USE_ID    (CUSTOMER_NAME  (CUSTOMER_NUMBER
3  (SITE_TYPE    (SITE_USE_ID    (CUSTOMER_NAME  (CUSTOMER_NUMBER
4  (SITE_TYPE    (SITE_USE_ID    (CUSTOMER_NAME  (CUSTOMER_NUMBER

Final Solution 最终解决方案

Here's my final solution for anyone curious, based on correct answer formatted for ease of viewing. 这是我的最终解决方案,适用于任何好奇的人,其正确答案的格式易于查看。

p <- proto(
 pre = function(.) .$k <- 0,
 fun = function(., x) {
 if (x == "(") .$k <- .$k + 1 else if (x == ")") .$k <- .$k - 1
 if (x == "(" && .$k == 1) "" else if (x == ")" && .$k == 0) "\n" else x
})
df.sub.final <- df.sub$text %>%
sub("^[^\\(]*\\(", "(", .) %>% 
sub("\\)[^\\)]*$", ")", .) %>% 
gsub("\n", "", .) %>%
gsub("=", ": ", .) %>%
gsubfn("([\\(\\)]) *", p, .) %>%
textConnection %>%
read.dcf %>%
as.data.frame(.)

There seems to be some uncertainty as to what the valid inputs are. 关于有效输入是什么,似乎存在一些不确定性。 Below are several different answers based on different assumptions. 以下是基于不同假设的几种不同答案。 All convert the input to dcf form (ie name: value) and then use read.dcf . 全部将输入转换为dcf形式(即名称:值),然后使用read.dcf

!) Balanced parentheses. !)圆括号。

Transform to dcf form (ie name: value). 转换为dcf格式(即名称:值)。

We can handle balanced parentheses with gsubfn . 我们可以使用gsubfn处理圆括号。 First create a proto object whose pre function initializes a counter k to zero and then for each match to ( or ) the function fun inputs it and increments or decrements k outputting the appropriate replacement character. 首先创建一个原型对象,其pre函数将计数器k初始化为零,然后对于与()的每次匹配,函数fun输入它,然后递增或递减k输出适当的替换字符。 See the gsubfn package vignette for more info. 有关更多信息,请参见gsubfn软件包插图。

Now starting from x replace the junk at the beginning, replace = with : and a space and then run gsubfn matching ( or ) followed by optional space with the proto object we defined. 现在从x开始,用开头替换垃圾,用:替换=,并加一个空格,然后运行gsubfn匹配(或),然后运行带有我们定义的原型对象的可选空格。 Finally read the transformed text using read.dcf . 最后,使用read.dcf读取转换后的文本。

library(gsubfn)
library(magrittr)

p <- proto(
 pre = function(.) .$k <- 0,
 fun = function(., x) {
  if (x == "(") .$k <- .$k + 1 else if (x == ")") .$k <- .$k - 1
  if (x == "(" && .$k == 1) "" else if (x == ")" && .$k == 0) "\n" else x
})

x %>%
  sub("^.*?\\(", "(", .) %>%
  gsub("=", ": ", .) %>%
  gsubfn("([\\(\\)]) *", p, .) %>%
  textConnection %>%
  read.dcf

2) Nested parentheses have no adjacent spaces 2)嵌套括号中没有相邻的空格

x <- "(SITE_TYPE=Site1) (SITE_USE_ID=2000) (CUSTOMER_NAME=cname) (CUSTOMER_NUMBER=11111) (ADDRESS1=addy1) (ADDRESS2=addy2) (ADDRESS3=addy3) (ADDRESS4=) (CITY=.) (STATE=) (ZIP=0000) (COUNTY=) (COUNTRY=NO) (CONTACT_NAME=)"


library(magrittr)

x %>%
  paste0(" ") %>%
  sub("^.*?\\(", "", .) %>%
  gsub(" +\\(", " ", .) %>%
  gsub("=", ": ", .) %>%
  gsub("\\) ", "\n", .) %>%
  textConnection %>%
  read.dcf

giving: 给予:

     SITE_TYPE SITE_USE_ID CUSTOMER_NAME CUSTOMER_NUMBER ADDRESS1 ADDRESS2
[1,] "Site1"   "2000"      "cname"       "11111"         "addy1"  "addy2" 
     ADDRESS3 ADDRESS4 CITY STATE ZIP    COUNTY COUNTRY CONTACT_NAME
[1,] "addy3"  ""       "."  ""    "0000" ""     "NO"    ""     

3) fixed keywords follow the outer left parentheses. 3)固定关键字在左括号后面。

For this one the inner parentheses can be unbalanced but the outer parentheses are always followed by one of the keywords in cn . 为此,内括号可以不平衡,但是外括号后总是cn中的关键字之一。

x <- "ADDRESS_VALIDATION_FAILED (SITE_TYPE=site1) (SITE_USE_ID=200) (CUSTOMER_NAME=abc) (CUSTOMER_NUMBER=1000) (ADDRESS1=issue here (some more text) (ADDRESS2=) (ADDRESS3=) (ADDRESS4=) (CITY=city, ) (STATE=na) (ZIP=250) (COUNTY=) (COUNTRY=NA) (CONTACT_NAME=)"

cn <- c("SITE_TYPE", "SITE_USE_ID", "CUSTOMER_NAME", "CUSTOMER_NUMBER", 
"ADDRESS1", "ADDRESS2", "ADDRESS3", "ADDRESS4", "CITY", "STATE", 
"ZIP", "COUNTY", "COUNTRY", "CONTACT_NAME")
rx <- sprintf(".(%s)", paste(cn, collapse = "|"))

x %>%
  sub("^.*?\\(", "(", .) %>%
  gsub("=", ": ", .) %>%
  gsub(rx, "\n\\1", .) %>%
  gsub("\\) *\\n", "\n", .) %>%
  sub("\\)$", "", .) %>%
  textConnection %>%
  read.dcf

giving: 给予:

     SITE_TYPE SITE_USE_ID CUSTOMER_NAME CUSTOMER_NUMBER
[1,] "site1"   "200"       "abc"         "1000"         
     ADDRESS1                     ADDRESS2 ADDRESS3 ADDRESS4 CITY    STATE
[1,] "issue here (some more text" ""       ""       ""       "city," "na" 
     ZIP   COUNTY COUNTRY CONTACT_NAME
[1,] "250" ""     "NA"    ""          

Note 注意

The input in reproducible form is: 可复制形式的输入为:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM