![](/img/trans.png)
[英]Separating column using separate (tidyr) via dplyr on a first encountered digit
[英]Separating a column in R using Regex & separate (tidyr)
这就是我希望能够做到的。
https://regex101.com/r/KchccA/1
我想匹配=和)之间的任何字符,同时还要考虑是否有一个空的捕获组,因为我希望每行填充所有字段。
行示例:在此示例中,Address4,County和Contact name为空。 您还可以查看某些值有误/不正确。 还有一些初始/结束文本,我也需要删除。
x <- "Please enter an UT location before booking the order.. ADDRESS_VALIDATION_FAILED (SITE_TYPE=uct) (SITE_USE_ID=1000) (CUSTOMER_NAME=cname) (CUSTOMER_NUMBER=2000) (ADDRESS1=addy1) (ADDRESS2=addy2) (ADDRESS3=addy3) (ADDRESS4=) (CITY=.) (STATE=) (ZIP=0000) (COUNTY=) (COUNTRY=NO) (CONTACT_NAME=) The task is raised for line_number: 7"
但是,在R中,当我尝试使用tidyr的单独方法时,最终会出现不良结果。 我不逃避吗?
这是我的代码:
df.sub <- separate(data = main.data, col = Order.Task.Text.CCW, into = c("SITE_TYPE", "SITE_USE_ID", "CUSTOMER_NAME","CUSTOMER_NUMBER", "ADDRESS1", "ADDRESS2", "ADDRESS3", "ADDRESS4", "CITY", "STATE", "ZIP", "COUNTY", "COUNTRY", "CONTACT_NAME"), sep = "=([^\\)]+|())\\)")
结果示例:
SITE_TYPE SITE_USE_ID CUSTOMER_NAME CUSTOMER_NUMBER
1 (SITE_TYPE (SITE_USE_ID (CUSTOMER_NAME (CUSTOMER_NUMBER
2 (SITE_TYPE (SITE_USE_ID (CUSTOMER_NAME (CUSTOMER_NUMBER
3 (SITE_TYPE (SITE_USE_ID (CUSTOMER_NAME (CUSTOMER_NUMBER
4 (SITE_TYPE (SITE_USE_ID (CUSTOMER_NAME (CUSTOMER_NUMBER
这是我的最终解决方案,适用于任何好奇的人,其正确答案的格式易于查看。
p <- proto(
pre = function(.) .$k <- 0,
fun = function(., x) {
if (x == "(") .$k <- .$k + 1 else if (x == ")") .$k <- .$k - 1
if (x == "(" && .$k == 1) "" else if (x == ")" && .$k == 0) "\n" else x
})
df.sub.final <- df.sub$text %>%
sub("^[^\\(]*\\(", "(", .) %>%
sub("\\)[^\\)]*$", ")", .) %>%
gsub("\n", "", .) %>%
gsub("=", ": ", .) %>%
gsubfn("([\\(\\)]) *", p, .) %>%
textConnection %>%
read.dcf %>%
as.data.frame(.)
关于有效输入是什么,似乎存在一些不确定性。 以下是基于不同假设的几种不同答案。 全部将输入转换为dcf形式(即名称:值),然后使用read.dcf
。
转换为dcf格式(即名称:值)。
我们可以使用gsubfn
处理圆括号。 首先创建一个原型对象,其pre
函数将计数器k
初始化为零,然后对于与(
或)
的每次匹配,函数fun
输入它,然后递增或递减k输出适当的替换字符。 有关更多信息,请参见gsubfn软件包插图。
现在从x
开始,用开头替换垃圾,用:替换=,并加一个空格,然后运行gsubfn
匹配(或),然后运行带有我们定义的原型对象的可选空格。 最后,使用read.dcf
读取转换后的文本。
library(gsubfn)
library(magrittr)
p <- proto(
pre = function(.) .$k <- 0,
fun = function(., x) {
if (x == "(") .$k <- .$k + 1 else if (x == ")") .$k <- .$k - 1
if (x == "(" && .$k == 1) "" else if (x == ")" && .$k == 0) "\n" else x
})
x %>%
sub("^.*?\\(", "(", .) %>%
gsub("=", ": ", .) %>%
gsubfn("([\\(\\)]) *", p, .) %>%
textConnection %>%
read.dcf
x <- "(SITE_TYPE=Site1) (SITE_USE_ID=2000) (CUSTOMER_NAME=cname) (CUSTOMER_NUMBER=11111) (ADDRESS1=addy1) (ADDRESS2=addy2) (ADDRESS3=addy3) (ADDRESS4=) (CITY=.) (STATE=) (ZIP=0000) (COUNTY=) (COUNTRY=NO) (CONTACT_NAME=)"
library(magrittr)
x %>%
paste0(" ") %>%
sub("^.*?\\(", "", .) %>%
gsub(" +\\(", " ", .) %>%
gsub("=", ": ", .) %>%
gsub("\\) ", "\n", .) %>%
textConnection %>%
read.dcf
给予:
SITE_TYPE SITE_USE_ID CUSTOMER_NAME CUSTOMER_NUMBER ADDRESS1 ADDRESS2
[1,] "Site1" "2000" "cname" "11111" "addy1" "addy2"
ADDRESS3 ADDRESS4 CITY STATE ZIP COUNTY COUNTRY CONTACT_NAME
[1,] "addy3" "" "." "" "0000" "" "NO" ""
为此,内括号可以不平衡,但是外括号后总是cn
中的关键字之一。
x <- "ADDRESS_VALIDATION_FAILED (SITE_TYPE=site1) (SITE_USE_ID=200) (CUSTOMER_NAME=abc) (CUSTOMER_NUMBER=1000) (ADDRESS1=issue here (some more text) (ADDRESS2=) (ADDRESS3=) (ADDRESS4=) (CITY=city, ) (STATE=na) (ZIP=250) (COUNTY=) (COUNTRY=NA) (CONTACT_NAME=)"
cn <- c("SITE_TYPE", "SITE_USE_ID", "CUSTOMER_NAME", "CUSTOMER_NUMBER",
"ADDRESS1", "ADDRESS2", "ADDRESS3", "ADDRESS4", "CITY", "STATE",
"ZIP", "COUNTY", "COUNTRY", "CONTACT_NAME")
rx <- sprintf(".(%s)", paste(cn, collapse = "|"))
x %>%
sub("^.*?\\(", "(", .) %>%
gsub("=", ": ", .) %>%
gsub(rx, "\n\\1", .) %>%
gsub("\\) *\\n", "\n", .) %>%
sub("\\)$", "", .) %>%
textConnection %>%
read.dcf
给予:
SITE_TYPE SITE_USE_ID CUSTOMER_NAME CUSTOMER_NUMBER
[1,] "site1" "200" "abc" "1000"
ADDRESS1 ADDRESS2 ADDRESS3 ADDRESS4 CITY STATE
[1,] "issue here (some more text" "" "" "" "city," "na"
ZIP COUNTY COUNTRY CONTACT_NAME
[1,] "250" "" "NA" ""
可复制形式的输入为:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.