简体   繁体   English

使用正则表达式解析复杂的字符串

[英]Parsing complex string using regex

My regex skills are not very good and recently a new data element has thrown my parser into a loop 我的正则表达式技能不是很好,最近有一个新的数据元素使我的解析器陷入循环

Take the following string 取以下字符串

"+USER=Bob Smith-GROUP=Admin+FUNCTION=Read/FUNCTION=Write" “ + USER =鲍勃·史密斯-GROUP = Admin + FUNCTION =读/功能=写”

Previously I had the following for my regex : [+\\\\-/] 以前,我的正则表达式具有以下特征:[+ \\\\-//]

Which would turn the result into 这将结果变成

USER=Bob Smith USER =鲍勃·史密斯
GROUP=Admin GROUP =管理员
FUNCTION=Read FUNCTION =读取
FUNCTION=Write FUNCTION =写入
FUNCTION=Read FUNCTION =读取

But now I have values with dashes in them which is causing bad output 但是现在我的值中带有破折号,这会导致输出错误

New string looks like "+USER=Bob Smith-GROUP=Admin+FUNCTION=Read/FUNCTION=Write/FUNCTION=Read-Write" 新字符串如下所示:“ + USER = Bob Smith-GROUP = Admin + FUNCTION = Read / FUNCTION = Write / FUNCTION = Read-Write”

Which gives me the following result , and breaks the key = value structure. 这给了我以下结果,并打破了键=值结构。

USER=Bob Smith USER =鲍勃·史密斯
GROUP=Admin GROUP =管理员
FUNCTION=Read FUNCTION =读取
FUNCTION=Write FUNCTION =写入
FUNCTION=Read FUNCTION =读取
Write

Can someone help me formulate a valid regex for handling this or point me to some key / value examples. 有人可以帮我制定一个有效的正则表达式来处理此问题,还是可以给我指出一些关键/值示例。 Basically I need to be able to handle + - / signs in order to get combinations. 基本上,我需要能够处理+-/符号才能获得组合。

You didn't specify what regex engine you're using, but this works if you've got lookahead/lookbehind. 您没有指定要使用的正则表达式引擎,但是如果您先行/后行,则可以使用此功能。

It works on the premise that the keys are all uppercase only, whilst the values aren't - not sure if that's a valid assumption, but if it's not then as noted things will get complicated and messy. 它的前提是键都是大写的,而键值不是大的-不确定这是否是一个有效的假设,但是如果不是这样,则会使事情变得复杂和混乱。

(?<=[+-\/])[A-Z]+=(?:(?![A-Z]+=)[^=])+(?=[+-\/]|$)


And here's my attempt to explain that (not sure how much this makes sense): 这是我尝试解释的内容(不确定这多少有意义):

(?x)         # enable regex comment mode
(?<=[+-\/])  # start with one of the delimiters, but excluded from match
[A-Z]+       # match one or more uppercase (for the key)
=            # match the equal sign

(?:          # start non-capturing group

  (?!          # start negative lookahead, to prevent keys matching
    [A-Z]+=      # a key and equals (since in negative lookahead, this is what we exclude)
  )            # end the negative lookahead
  [^=]         # match a character that's not =

)+           # end non-capturing group, match one or more times, until...

(?=[+-\/]|$) # next char must be delimiter or end of line for match to succeed


For Java string->regex, backslashes need escaping (as would quotes, if there were any): 对于Java string-> regex,反斜杠需要转义(如果有引号,也要转引):

Pattern p = Pattern.compile("(?<=[+-\\/])[A-Z]+=(?:(?![A-Z]+=)[^=])+(?=[+-\\/]|$)");


And if capturing groups are needed, just add parens round the appropriate parts: 如果需要捕获组,只需在适当的部分添加括号:

Pattern p = Pattern.compile("(?<=[+-\\/])([A-Z]+)=((?:(?![A-Z]+=)[^=])+(?=[+-\\/]|$))");


The matching part of this, to turn it into newline delimited text, is something like... 匹配的部分将其变成换行符分隔的文本,类似于...

Matcher m = p.Matcher( InputText );
StringBuffer Result = new StringBuffer("");

while ( m.find() )
{
   Result.append( m.Group() + "\n" );
}

Based on your second example, this regex: (\\w+)=([\\w|-|\\s]+) returns these results: 根据您的第二个示例,此正则表达式: (\\w+)=([\\w|-|\\s]+)返回以下结果:

USER=Bob Smith
GROUP=Admin
FUNCTION=Read
FUNCTION=Write
FUNCTION=Read-Write

The parenthesis provide groupings for each element, so each match will contain two groups, the first will have the part before the = (so USER,GROUP,FUNCTION) and the second will have the value (Bob Smith, Admin, Read, Write, Read-Write) 括号为每个元素提供了分组,因此每个匹配项将包含两个组,第一个将具有=之前的部分(因此USER,GROUP,FUNCTION),第二个将具有值(Bob Smith,Admin,Read,Write,读写)

You can also name the groups if that would make it easier: 您也可以命名组,如果这样可以更容易:

(?<funcrion>\w+)=(?<value>[\w|-|\s]+)  

Or if you don't care about the groups, you can remove the parens altogether 或者,如果您不在乎这些组,则可以完全删除括号

\w+=[\w|-|\s]+

Another option, if you've got a limited set of keys, you could just match: 另一个选择是,如果您的键集有限,则可以匹配:

(?<=[+-\\/])(USER|GROUP|FUNCTION)=[^=]+(?=$|[+-\\/](?:USER|GROUP|FUNCTION))


Which in Java I'd probably implement like this: 在Java中,我可能会这样实现:

String Key = "USER|GROUP|FUNCTION" ;
String Delim = "[+-\\/]";
Pattern p = Pattern.compile("(?<="+Delim+")("+Key+")=[^=]+(?=$|"+Delim+"(?:"+Key+"))");

This relies on, for example "Write" not being a valid key (and if you can force the case of keys to be either "write" or "WRITE" then that means it'll work). 这依赖于例如“ Write”不是有效的密钥(并且如果您可以强制将密钥的大小写为“ write”或“ WRITE”,那么这将起作用)。


The matching part of this, to turn it into newline delimited text, is something like... 匹配的部分将其变成换行符分隔的文本,类似于...

Matcher m = p.Matcher( InputText );
StringBuffer Result = new StringBuffer("");

while ( m.find() )
{
   Result.append( m.Group() + "\n" );
}

If you're delimiting fields with characters that can appear in values, you're screwed. 如果您用可能出现在值中的字符来分隔字段,则很麻烦。

Suppose you receive a string like: 假设您收到类似以下的字符串:

one=a-two=b-three=c-d-four=e

Should that parse into this? 那应该解析成这个吗?

one=a
two=b
three=c-d
four=e

Or should it parse into this? 还是应该解析成这个?

one=a
two=b
three=c
d-four=e

How do you know? 你怎么知道的? What's your basis for deciding this? 您决定这个的依据是什么?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM