简体   繁体   中英

how to parse a file with keyword-value pairs and {} and line breaks in Java?

In a file I have some variables stored like this:

author = {Some Author},
link = {some link},
text = { bla bla bla bla bla bla bla bla bla bla bla
bla bla bla bla bla bla bla bla bla bla bla bla bla},
...

Some of the variables are on multiline.

After that i need to spit the every String entry into key and value, but thats not a problem.I'm so far:

\\S+\\s*[=][{]\\s*\\S*[},]

The solutions, that are working fine for me are:

(\w+)\s*=\s*\{(.*?)\}

and

\\S+\\s*[=]\\s*[{].*[},]

It's not obvious from your post, but this looks like a bibtex file. If it is then braces can occur within braces, meaning your language is not "regular" and cannot be described by regular expressions such as the one you provide.

If not, then you want something like

(\w+)\s*=\s*\{(.*?)\}

but writing a parser is probably the most respectable way to solve your problem. If it is bibtex you are parsing, an open source Java bibliography manager (such as Jabref) might give you some ideas on building something more robust.

I would recommend that you not use regexes for this, since it seems your format is a bit too free-form. Writing a simple parser that first reads a string up to the = as a key and then reads the insides of the braces up to the separating comma or end-of-file without caring about newlines would, to me, seem a simpler approach. And if you need it to, you can replace the newlines with spaces as you go. It also has the benefit that if your values can contain braces, suitably escaped, it is simpler to handle them with an actual parser than with regexes.

This format seems simple enough and unlikely to be extended overmuch that a hand-written parser is pretty suitable. But for a more complex language, or even if you just want the exercise, you could use a parser generator to build your parser, which has the benefit of a much more comprehensible language definition. I understand ANTLR is a popular one to use in Java.

You could use String class's split method.

public String[] split(String regex)

Splits this string around matches of the given regular expression.

You could first split the input at comma, then split the text between {} by white space ( \\s ).

have you considered Java properties files? http://en.wikipedia.org/wiki/.properties

你应该使用属性 ,正则表达式不是你的情况下的好解决方案。

Using a different file format will probably save you some headaches but you could parse it like:

Pattern p = Pattern.compile("\\s*(\\w+)\\s*=\\s*\\{(.*?)\\},?\\s*", Pattern.DOTALL);
while (true) {
    Matcher m = p.matcher(input);
    if (!m.find()) break;
    String key = m.group(1);
    String val = m.group(2);
    System.out.println("OK: key=" + key + ", val=" + val);
    input = m.replaceFirst("");
}

Just replace the println with insertion into your Map.

I'm not sure exactly what you're asking and your regex isn't much help in providing additional information.

However, if brackets can't nest and you don't want to handle escaped brackets then the regex is pretty straight-forward.

Note: even your most recent regex (probably should have just edited your post instead of responding to yourself: \\\\S+\\\\s*[=]\\\\s*[{].*[},] Is doing some things it doesn't need to that will certainly mess you up. The over-use of [] style character classes is probably confusing you. Your last [},] is really saying "character matching '}' or ','" which is I'm pretty sure not what you mean.

Regex seems to be everyone's favorite whipping boy but I think it's appropriate here.

Pattern p = Pattern.compile( "\\s*([^={}]+)\\s*=\\s*{([^}]+)},?" );
Matcher m = p.matcher( someString );
while( m.find() ) {
    System.out.println( "name:" + m.group(1) + " value:" + m.group(2) );
}

The regex breaks down as:

  • Any preceding whitespace.
  • First capture group is a non-zero length string containing only characters that are NOT '=', '{', or '}'
  • Any intermediate whitespace.
  • '='
  • Any intermediate whitespace.
  • '{'
  • Second capture group is a non-zero length string containing only characters that are not the closing '}'
  • '}'
  • Optional ','

This regex should perform more efficiently than the .* versions because it is easier for it to figure out where to stop. I also think it is clearer but I speak regex conversationally. :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM