简体   繁体   English

如何用Java中的关键字 - 值对和{}以及换行符解析文件?

[英]how to parse a file with keyword-value pairs and {} and line breaks in Java?

In a file I have some variables stored like this: 在一个文件中,我有一些变量存储如下:

author = {Some Author},
link = {some link},
text = { bla bla bla bla bla bla bla bla bla bla bla
bla bla bla bla bla bla bla bla bla bla bla bla bla},
...

Some of the variables are on multiline. 一些变量在多行上。

After that i need to spit the every String entry into key and value, but thats not a problem.I'm so far: 之后我需要将每个String条目吐入键和值,但这不是问题。我到目前为止:

\\S+\\s*[=][{]\\s*\\S*[},]

The solutions, that are working fine for me are: 对我来说工作正常的解决方案是:

(\w+)\s*=\s*\{(.*?)\}

and

\\S+\\s*[=]\\s*[{].*[},]

It's not obvious from your post, but this looks like a bibtex file. 你的文章并不明显,但这看起来像一个bibtex文件。 If it is then braces can occur within braces, meaning your language is not "regular" and cannot be described by regular expressions such as the one you provide. 如果是,则大括号可以在大括号内出现,这意味着您的语言不是“常规”,并且不能通过正则表达式(例如您提供的语言)来描述。

If not, then you want something like 如果没有,那么你想要的东西

(\w+)\s*=\s*\{(.*?)\}

but writing a parser is probably the most respectable way to solve your problem. 但编写解析器可能是解决问题最可敬的方法。 If it is bibtex you are parsing, an open source Java bibliography manager (such as Jabref) might give you some ideas on building something more robust. 如果它是bibtex你正在解析,一个开源Java书目管理器(如Jabref)可能会给你一些关于构建更健壮的东西的想法。

I would recommend that you not use regexes for this, since it seems your format is a bit too free-form. 我建议您不要使用正则表达式,因为看起来您的格式有点过于自由形式。 Writing a simple parser that first reads a string up to the = as a key and then reads the insides of the braces up to the separating comma or end-of-file without caring about newlines would, to me, seem a simpler approach. 编写一个简单的解析器,首先读取一个字符串直到=作为一个键然后读取括号的内部直到分隔逗号或文件结尾而不关心换行符对我来说,似乎是一种更简单的方法。 And if you need it to, you can replace the newlines with spaces as you go. 如果你需要它,你可以随时用空格替换换行符。 It also has the benefit that if your values can contain braces, suitably escaped, it is simpler to handle them with an actual parser than with regexes. 它还有一个好处,如果您的值可以包含大括号,适当地转义,使用实际解析器处理它们比使用正则表达式更简单。

This format seems simple enough and unlikely to be extended overmuch that a hand-written parser is pretty suitable. 这种格式看起来很简单,并且不太可能扩展,因为手写的解析器非常适合。 But for a more complex language, or even if you just want the exercise, you could use a parser generator to build your parser, which has the benefit of a much more comprehensible language definition. 但是对于更复杂的语言,或者即使您只是想要练习,您也可以使用解析器生成器来构建解析器,这样可以获得更易于理解的语言定义。 I understand ANTLR is a popular one to use in Java. 我知道ANTLR是一种在Java中使用的流行的。

You could use String class's split method. 您可以使用String类的split方法。

public String[] split(String regex)

Splits this string around matches of the given regular expression. 将此字符串拆分为给定正则表达式的匹配项。

You could first split the input at comma, then split the text between {} by white space ( \\s ). 您可以先用逗号分割输入,然后在{}之间用空格分隔文本( \\s )。

have you considered Java properties files? 你考虑过Java属性文件吗? http://en.wikipedia.org/wiki/.properties http://en.wikipedia.org/wiki/.properties

你应该使用属性 ,正则表达式不是你的情况下的好解决方案。

Using a different file format will probably save you some headaches but you could parse it like: 使用不同的文件格式可能会让您感到头痛,但您可以解析它:

Pattern p = Pattern.compile("\\s*(\\w+)\\s*=\\s*\\{(.*?)\\},?\\s*", Pattern.DOTALL);
while (true) {
    Matcher m = p.matcher(input);
    if (!m.find()) break;
    String key = m.group(1);
    String val = m.group(2);
    System.out.println("OK: key=" + key + ", val=" + val);
    input = m.replaceFirst("");
}

Just replace the println with insertion into your Map. 只需将println插入Map中即可。

I'm not sure exactly what you're asking and your regex isn't much help in providing additional information. 我不确定你问的是什么,你的正则表达式在提供额外信息方面没什么帮助。

However, if brackets can't nest and you don't want to handle escaped brackets then the regex is pretty straight-forward. 但是,如果括号无法嵌套而您不想处理转义括号,则正则表达式非常简单。

Note: even your most recent regex (probably should have just edited your post instead of responding to yourself: \\\\S+\\\\s*[=]\\\\s*[{].*[},] Is doing some things it doesn't need to that will certainly mess you up. The over-use of [] style character classes is probably confusing you. Your last [},] is really saying "character matching '}' or ','" which is I'm pretty sure not what you mean. 注意:即使你最近的正则表达式(可能应该刚刚编辑你的帖子而不是回应你自己: \\\\S+\\\\s*[=]\\\\s*[{].*[},]正在做一些事情它没有不需要这样做肯定会让你感到困惑。过度使用[]风格的角色类可能让你感到困惑。你的最后一个[},]真的是说“字符匹配'}'或','”这就是我我很肯定不是你的意思。

Regex seems to be everyone's favorite whipping boy but I think it's appropriate here. 正则表达式似乎是每个人最喜欢的鞭打男孩,但我认为这是合适的。

Pattern p = Pattern.compile( "\\s*([^={}]+)\\s*=\\s*{([^}]+)},?" );
Matcher m = p.matcher( someString );
while( m.find() ) {
    System.out.println( "name:" + m.group(1) + " value:" + m.group(2) );
}

The regex breaks down as: 正则表达式分解为:

  • Any preceding whitespace. 任何前面的空格。
  • First capture group is a non-zero length string containing only characters that are NOT '=', '{', or '}' 第一个捕获组是一个非零长度的字符串,仅包含非'=','{'或'}'的字符
  • Any intermediate whitespace. 任何中间空格。
  • '=' '='
  • Any intermediate whitespace. 任何中间空格。
  • '{' '{'
  • Second capture group is a non-zero length string containing only characters that are not the closing '}' 第二个捕获组是一个非零长度的字符串,仅包含不是结束的字符'}'
  • '}' '}'
  • Optional ',' 可选的 ','

This regex should perform more efficiently than the .* versions because it is easier for it to figure out where to stop. 这个正则表达式应该比。*版本更有效地执行,因为它更容易找出停止的位置。 I also think it is clearer but I speak regex conversationally. 我也认为它更清楚,但我会说话正则表达式。 :) :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM