简体   繁体   English

用String.split在两个字符序列之间不匹配正则表达式

[英]Not match Regex in between two character sequences with `String.split`

I'm using Scala to work with some very messy data that it is not practical to clean. 我正在使用Scala处理一些非常凌乱的数据,这些数据很难清理。 It comes in the form of delimited key-value pairs, something like this: "a=1, b=2, c=3" . 它以分隔的键/值对的形式出现,例如: "a=1, b=2, c=3" I am using String.split to break up the String into key-value pairs. 我正在使用String.split将String分解为键-值对。 Most of the string value parts of these pairs are quoted if the need to be, so this works to not match , inside of quotes: <string-instance>.split(", (?=(?:[^\\"]*\\"[^\\"]*\\")*[^\\"]*$)") 如果需要,这些对中的大多数字符串值部分都用引号引起来,因此这可以不匹配,在引号内: <string-instance>.split(", (?=(?:[^\\"]*\\"[^\\"]*\\")*[^\\"]*$)")

However, I have come across a url field that is neither quoted nor in all cases URL-encoded, so I have to deal with something like this: 但是,我遇到的url字段既不加引号也不在所有情况下均采用URL编码,因此我必须处理以下问题:

"foo=bar, url=http://city.com/Boston, MA US, is_test=false"

In this case, I'm trying to match the comma-space after bar and the one after US and ignore the one after Boston . 在这种情况下,我试图将bar之后的逗号与US之后的逗号匹配,而忽略Boston之后的逗号。 Fortunately, I can rely on these bad cases falling in between url= and , is_test= everywhere they occur (and that's about it). 幸运的是,我可以依靠发生在url=, is_test=之间的这些不良情况(仅此而已)。 I've been banging my head on the Java regex tester here: https://www.freeformatter.com/java-regex-tester.html and failing. 我一直在这里的Java regex测试器上大打出手: https//www.freeformatter.com/java-regex-tester.html并失败了。 The closest I could get with the above input was this: (?<!url=[.]{0,300}^, is_test), (?!.*, is_test) , which only matched the comma-space after US , not the one after bar . 通过以上输入我能得到的最接近的是: (?<!url=[.]{0,300}^, is_test), (?!.*, is_test) ,它仅匹配US之后的逗号,而不是一个接一个bar The {0,300} is there to alleviate the problem of Java Regex not being able to handle potentially infinite look-behind expressions: java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length {0,300}可以缓解Java Regex无法处理潜在的无限后向表达式的问题: java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length

How can I solve this? 我该如何解决? Ideally, I could or the expression with the quoted comma-space ignoring one. 理想情况下,我可以将带引号逗号的表达式忽略掉。 One possibility too would be to match 一种可能性也是匹配 in between url= and , is_test and replace them with %20 . url=和之间, is_test并将它们替换为%20 Unfortunately on that Regex expression, the closest I got was (?<=url=.{0,300})\\s(?!^\\w*, is_test) which matched the white-space right before is_test which I don't want to touch. 不幸的是,在该Regex表达式上,我得到的最接近的是(?<=url=.{0,300})\\s(?!^\\w*, is_test) ,它与is_test之前的is_test ,我不想触摸。

==edit== == ==编辑

My first example did not include a query string with a = which is a major part of my problem. 我的第一个示例未包含带有=的查询字符串,这是我问题的主要部分。 Here is a more complete example of what I am dealing with: 这是我正在处理的更完整的示例:

foo="bar, harbor", url=http://city.com/start_city=Boston, MA US&end_city=New York, NY US, is_test=false

As your key value pair is separated by = and each of your pair is separated by a comma and some space, you can split on every comma which is just immediately before a = character using this regex, 由于键值对用=分隔,并且每个键对都用逗号和空格隔开,因此您可以使用此正则表达式对正好在=字符之前的每个逗号进行分割,

,\s*(?=\w+=)

Online Demo 在线演示

Check these Java codes which split your string at desired positions, 检查这些将字符串分割到所需位置的Java代码,

String[] data = "foo=\"bar, harbor\", url=http://city.com/start_city=Boston, MAUS&end_city=New York, NY US, is_test=false".split(",\\s*(?=\\w+=)");
Arrays.stream(data).forEach(System.out::println);

Prints, 打印,

foo="bar, harbor"
url=http://city.com/start_city=Boston, MAUS&end_city=New York, NY US
is_test=false

Let me know if this works for your cases and if not, please add the cases for which it doesn't work. 让我知道这是否适用于您的情况,如果不行,请添加不适用的情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM