[英]Not match Regex in between two character sequences with `String.split`
I'm using Scala to work with some very messy data that it is not practical to clean. 我正在使用Scala处理一些非常凌乱的数据,这些数据很难清理。 It comes in the form of delimited key-value pairs, something like this:
"a=1, b=2, c=3"
. 它以分隔的键/值对的形式出现,例如:
"a=1, b=2, c=3"
。 I am using String.split
to break up the String into key-value pairs. 我正在使用
String.split
将String分解为键-值对。 Most of the string value parts of these pairs are quoted if the need to be, so this works to not match ,
inside of quotes: <string-instance>.split(", (?=(?:[^\\"]*\\"[^\\"]*\\")*[^\\"]*$)")
如果需要,这些对中的大多数字符串值部分都用引号引起来,因此这可以不匹配
,
在引号内: <string-instance>.split(", (?=(?:[^\\"]*\\"[^\\"]*\\")*[^\\"]*$)")
However, I have come across a url
field that is neither quoted nor in all cases URL-encoded, so I have to deal with something like this: 但是,我遇到的
url
字段既不加引号也不在所有情况下均采用URL编码,因此我必须处理以下问题:
"foo=bar, url=http://city.com/Boston, MA US, is_test=false"
In this case, I'm trying to match the comma-space after bar
and the one after US
and ignore the one after Boston
. 在这种情况下,我试图将
bar
之后的逗号与US
之后的逗号匹配,而忽略Boston
之后的逗号。 Fortunately, I can rely on these bad cases falling in between url=
and , is_test=
everywhere they occur (and that's about it). 幸运的是,我可以依靠发生在
url=
和, is_test=
之间的这些不良情况(仅此而已)。 I've been banging my head on the Java regex tester here: https://www.freeformatter.com/java-regex-tester.html and failing. 我一直在这里的Java regex测试器上大打出手: https : //www.freeformatter.com/java-regex-tester.html并失败了。 The closest I could get with the above input was this:
(?<!url=[.]{0,300}^, is_test), (?!.*, is_test)
, which only matched the comma-space after US
, not the one after bar
. 通过以上输入我能得到的最接近的是:
(?<!url=[.]{0,300}^, is_test), (?!.*, is_test)
,它仅匹配US
之后的逗号,而不是一个接一个bar
。 The {0,300}
is there to alleviate the problem of Java Regex not being able to handle potentially infinite look-behind expressions: java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length
{0,300}
可以缓解Java Regex无法处理潜在的无限后向表达式的问题: java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length
How can I solve this? 我该如何解决? Ideally, I could or the expression with the quoted comma-space ignoring one.
理想情况下,我可以将带引号逗号的表达式忽略掉。 One possibility too would be to match
一种可能性也是匹配
in between
url=
and , is_test
and replace them with %20
. 在
url=
和之间, is_test
并将它们替换为%20
。 Unfortunately on that Regex expression, the closest I got was (?<=url=.{0,300})\\s(?!^\\w*, is_test)
which matched the white-space right before is_test
which I don't want to touch. 不幸的是,在该Regex表达式上,我得到的最接近的是
(?<=url=.{0,300})\\s(?!^\\w*, is_test)
,它与is_test
之前的is_test
,我不想触摸。
==edit== == ==编辑
My first example did not include a query string with a =
which is a major part of my problem. 我的第一个示例未包含带有
=
的查询字符串,这是我问题的主要部分。 Here is a more complete example of what I am dealing with: 这是我正在处理的更完整的示例:
foo="bar, harbor", url=http://city.com/start_city=Boston, MA US&end_city=New York, NY US, is_test=false
As your key value pair is separated by =
and each of your pair is separated by a comma and some space, you can split on every comma which is just immediately before a =
character using this regex, 由于键值对用
=
分隔,并且每个键对都用逗号和空格隔开,因此您可以使用此正则表达式对正好在=
字符之前的每个逗号进行分割,
,\s*(?=\w+=)
Check these Java codes which split your string at desired positions, 检查这些将字符串分割到所需位置的Java代码,
String[] data = "foo=\"bar, harbor\", url=http://city.com/start_city=Boston, MAUS&end_city=New York, NY US, is_test=false".split(",\\s*(?=\\w+=)");
Arrays.stream(data).forEach(System.out::println);
Prints, 打印,
foo="bar, harbor"
url=http://city.com/start_city=Boston, MAUS&end_city=New York, NY US
is_test=false
Let me know if this works for your cases and if not, please add the cases for which it doesn't work. 让我知道这是否适用于您的情况,如果不行,请添加不适用的情况。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.