[英]Separating a column in R using Regex & separate (tidyr)
This is what I am looking to be able to do. 这就是我希望能够做到的。
https://regex101.com/r/KchccA/1 https://regex101.com/r/KchccA/1
I want to match on any characters in-between = and ) while also considering if there is a null captured group, as I want all fields to be populated per row. 我想匹配=和)之间的任何字符,同时还要考虑是否有一个空的捕获组,因为我希望每行填充所有字段。
Example of a row: In this example Address4, County, and Contact name are null. 行示例:在此示例中,Address4,County和Contact name为空。 You can also see how some have wrong / incorrect values.
您还可以查看某些值有误/不正确。 Theres also some initial / ending text too I need to remove.
还有一些初始/结束文本,我也需要删除。
x <- "Please enter an UT location before booking the order.. ADDRESS_VALIDATION_FAILED (SITE_TYPE=uct) (SITE_USE_ID=1000) (CUSTOMER_NAME=cname) (CUSTOMER_NUMBER=2000) (ADDRESS1=addy1) (ADDRESS2=addy2) (ADDRESS3=addy3) (ADDRESS4=) (CITY=.) (STATE=) (ZIP=0000) (COUNTY=) (COUNTRY=NO) (CONTACT_NAME=) The task is raised for line_number: 7"
However in R when I try to utilize tidyr's separate method I end up with undesirable results. 但是,在R中,当我尝试使用tidyr的单独方法时,最终会出现不良结果。 Am I not escaping it right?
我不逃避吗?
Here was my code for it: 这是我的代码:
df.sub <- separate(data = main.data, col = Order.Task.Text.CCW, into = c("SITE_TYPE", "SITE_USE_ID", "CUSTOMER_NAME","CUSTOMER_NUMBER", "ADDRESS1", "ADDRESS2", "ADDRESS3", "ADDRESS4", "CITY", "STATE", "ZIP", "COUNTY", "COUNTRY", "CONTACT_NAME"), sep = "=([^\\)]+|())\\)")
Example of Results: 结果示例:
SITE_TYPE SITE_USE_ID CUSTOMER_NAME CUSTOMER_NUMBER
1 (SITE_TYPE (SITE_USE_ID (CUSTOMER_NAME (CUSTOMER_NUMBER
2 (SITE_TYPE (SITE_USE_ID (CUSTOMER_NAME (CUSTOMER_NUMBER
3 (SITE_TYPE (SITE_USE_ID (CUSTOMER_NAME (CUSTOMER_NUMBER
4 (SITE_TYPE (SITE_USE_ID (CUSTOMER_NAME (CUSTOMER_NUMBER
Here's my final solution for anyone curious, based on correct answer formatted for ease of viewing. 这是我的最终解决方案,适用于任何好奇的人,其正确答案的格式易于查看。
p <- proto(
pre = function(.) .$k <- 0,
fun = function(., x) {
if (x == "(") .$k <- .$k + 1 else if (x == ")") .$k <- .$k - 1
if (x == "(" && .$k == 1) "" else if (x == ")" && .$k == 0) "\n" else x
})
df.sub.final <- df.sub$text %>%
sub("^[^\\(]*\\(", "(", .) %>%
sub("\\)[^\\)]*$", ")", .) %>%
gsub("\n", "", .) %>%
gsub("=", ": ", .) %>%
gsubfn("([\\(\\)]) *", p, .) %>%
textConnection %>%
read.dcf %>%
as.data.frame(.)
There seems to be some uncertainty as to what the valid inputs are. 关于有效输入是什么,似乎存在一些不确定性。 Below are several different answers based on different assumptions.
以下是基于不同假设的几种不同答案。 All convert the input to dcf form (ie name: value) and then use
read.dcf
. 全部将输入转换为dcf形式(即名称:值),然后使用
read.dcf
。
Transform to dcf form (ie name: value). 转换为dcf格式(即名称:值)。
We can handle balanced parentheses with gsubfn
. 我们可以使用
gsubfn
处理圆括号。 First create a proto object whose pre
function initializes a counter k
to zero and then for each match to (
or )
the function fun
inputs it and increments or decrements k outputting the appropriate replacement character. 首先创建一个原型对象,其
pre
函数将计数器k
初始化为零,然后对于与(
或)
的每次匹配,函数fun
输入它,然后递增或递减k输出适当的替换字符。 See the gsubfn package vignette for more info. 有关更多信息,请参见gsubfn软件包插图。
Now starting from x
replace the junk at the beginning, replace = with : and a space and then run gsubfn
matching ( or ) followed by optional space with the proto object we defined. 现在从
x
开始,用开头替换垃圾,用:替换=,并加一个空格,然后运行gsubfn
匹配(或),然后运行带有我们定义的原型对象的可选空格。 Finally read the transformed text using read.dcf
. 最后,使用
read.dcf
读取转换后的文本。
library(gsubfn)
library(magrittr)
p <- proto(
pre = function(.) .$k <- 0,
fun = function(., x) {
if (x == "(") .$k <- .$k + 1 else if (x == ")") .$k <- .$k - 1
if (x == "(" && .$k == 1) "" else if (x == ")" && .$k == 0) "\n" else x
})
x %>%
sub("^.*?\\(", "(", .) %>%
gsub("=", ": ", .) %>%
gsubfn("([\\(\\)]) *", p, .) %>%
textConnection %>%
read.dcf
x <- "(SITE_TYPE=Site1) (SITE_USE_ID=2000) (CUSTOMER_NAME=cname) (CUSTOMER_NUMBER=11111) (ADDRESS1=addy1) (ADDRESS2=addy2) (ADDRESS3=addy3) (ADDRESS4=) (CITY=.) (STATE=) (ZIP=0000) (COUNTY=) (COUNTRY=NO) (CONTACT_NAME=)"
library(magrittr)
x %>%
paste0(" ") %>%
sub("^.*?\\(", "", .) %>%
gsub(" +\\(", " ", .) %>%
gsub("=", ": ", .) %>%
gsub("\\) ", "\n", .) %>%
textConnection %>%
read.dcf
giving: 给予:
SITE_TYPE SITE_USE_ID CUSTOMER_NAME CUSTOMER_NUMBER ADDRESS1 ADDRESS2
[1,] "Site1" "2000" "cname" "11111" "addy1" "addy2"
ADDRESS3 ADDRESS4 CITY STATE ZIP COUNTY COUNTRY CONTACT_NAME
[1,] "addy3" "" "." "" "0000" "" "NO" ""
For this one the inner parentheses can be unbalanced but the outer parentheses are always followed by one of the keywords in cn
. 为此,内括号可以不平衡,但是外括号后总是
cn
中的关键字之一。
x <- "ADDRESS_VALIDATION_FAILED (SITE_TYPE=site1) (SITE_USE_ID=200) (CUSTOMER_NAME=abc) (CUSTOMER_NUMBER=1000) (ADDRESS1=issue here (some more text) (ADDRESS2=) (ADDRESS3=) (ADDRESS4=) (CITY=city, ) (STATE=na) (ZIP=250) (COUNTY=) (COUNTRY=NA) (CONTACT_NAME=)"
cn <- c("SITE_TYPE", "SITE_USE_ID", "CUSTOMER_NAME", "CUSTOMER_NUMBER",
"ADDRESS1", "ADDRESS2", "ADDRESS3", "ADDRESS4", "CITY", "STATE",
"ZIP", "COUNTY", "COUNTRY", "CONTACT_NAME")
rx <- sprintf(".(%s)", paste(cn, collapse = "|"))
x %>%
sub("^.*?\\(", "(", .) %>%
gsub("=", ": ", .) %>%
gsub(rx, "\n\\1", .) %>%
gsub("\\) *\\n", "\n", .) %>%
sub("\\)$", "", .) %>%
textConnection %>%
read.dcf
giving: 给予:
SITE_TYPE SITE_USE_ID CUSTOMER_NAME CUSTOMER_NUMBER
[1,] "site1" "200" "abc" "1000"
ADDRESS1 ADDRESS2 ADDRESS3 ADDRESS4 CITY STATE
[1,] "issue here (some more text" "" "" "" "city," "na"
ZIP COUNTY COUNTRY CONTACT_NAME
[1,] "250" "" "NA" ""
The input in reproducible form is: 可复制形式的输入为:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.