简体繁体 English

Java Regex，使用逗号分隔值捕获组

[英]Java Regex, capturing groups with comma separated values

原文 2010-02-18 08:47:54 3 3 java/ regex/ csv/ capturing-group

InputString : A soldier may have bruises , wounds , marks , dislocations or other Injuries that hurt him . InputString ：士兵可能有瘀伤，伤口，痕迹，脱臼或其他伤害他的伤害。

ExpectedOutput : 预期输出 ：
bruises 瘀伤
wounds 伤口
marks 分数
dislocations 脱位
Injuries 受伤

Generalized Pattern Tried : 广义模式尝试 ：

".[\s]?(\w+?)"+                 // bruises.
      "(?:(\s)?,(\s)?(\w+?))*"+             // wounds marks dislocations
      "[\s]?(?:or|and) other (\w+).";     // Injuries

The pattern should be able to match other input strings like: A soldier may have bruiser or other injuries that hurt him. 该模式应能够匹配其他输入字符串，例如： 士兵可能有瘀伤或其他伤害他的伤害。

On trying the generalized pattern above, the output is: bruises dislocations Injuries 在尝试上述通用模式时，输出为：瘀伤脱位受伤

There is something wrong with the capturing group for "(?:(\\s)?,(\\s)?(\\w+?))*". 捕获组“（？：（\\ s）？，（\\ s）？（\\ w +？））*”有问题。 The capturing group has one more occurences.. but it returns only "dislocations". 捕获组又发生了一次..但它仅返回“位错”。 "marks" and "dislocation: are devoured. “标记”和“错位：被吞噬。

Could you please suggest what should be the right pattern, and where is the mistake? 您能否建议正确的模式，错误在哪里？ This question comes closest to this question , but that solution didn't help. 这个问题最接近这个问题，但是该解决方案没有帮助。

Thanks. 谢谢。

3 个解决方案

Regex in not suited for (natural) language processing. 正则表达式不适合（自然）语言处理。 With regex, you can only match well defined patterns. 使用正则表达式，您只能匹配定义良好的模式。 You should really, really abandon the idea of doing this with regex. 您应该真的放弃使用正则表达式的想法。

You may want to start a new question where you specify what programming language you're using to perform this task and ask for pointers there. 您可能想开始一个新的问题，在其中指定用于执行此任务的编程语言，并在那里找到指针。

EDIT 编辑

PSpeed posted a promising link to a 3rd party library, Gate , that's able to do many language processing tasks. PSpeed发布了一个有希望的链接到第三方库Gate ，该库可以执行许多语言处理任务。 And it's written in Java. 它是用Java编写的。 I have not used it myself, but looking at the people/institutions working on it, it seems pretty solid. 我本人并没有使用过它，但是从研究它的人员/机构来看，它似乎很可靠。

The pattern that works is: \\w+(?:\\s*,\\s*\\w+)* and then manually separate CSV There is no other method to do this with Java Regex. 起作用的模式是：\\ w +（？：\\ s *，\\ s * \\ w +）*，然后手动分离CSV Java Regex没有其他方法可以做到这一点。

Ideally, Java regex is not suitable for NLP. 理想情况下，Java regex不适合NLP。 A useful tool for text mining is: gate.ac.uk 文本挖掘的有用工具是：gate.ac.uk
Thanks to Bart K. , and PSpeed. 感谢Bart K.和PSpeed。

When the capture group is annotated with a quantifier [ie: (foo)*] then you will only get the last match. 当捕获组带有一个量词[即（foo）*]注释时，您只会得到最后一个匹配项。 If you wanted to get all of them then you need to quantifier inside the capture and then you will have to manually parse out the values. 如果要获取所有这些值，则需要在捕获中进行量化，然后必须手动解析出值。 As big a fan as I am of regex, I don't think it's appropriate here for any number of reasons... even if you weren't ultimately doing NLP. 尽管我不喜欢正则表达式，但由于种种原因，我认为它不适合在这里使用……即使您最终没有做过NLP。

How to fix: (?:(\\s)?,(\\s)?(\\w+?))* 如何修复：（？：（\\ s）？，（\\ s）？（\\ w +？））*

Well, the quantifier basically covers the whole regex in that case and you might as well use Matcher.find() to step through each match. 好吧，在这种情况下，量词基本上涵盖了整个正则表达式，您不妨使用Matcher.find（）逐步完成每个匹配项。 Also, I'm curious why you have capture groups for the whitespace. 另外，我很好奇为什么会有空白的捕获组。 If all you are trying to do is find a comma-separated set of words then that's something like: \\w+(?:\\s*,\\s*\\w+)* Then don't bother with capture groups and just split the whole match. 如果您只想查找逗号分隔的一组单词，则类似于：\\ w +（？：\\ s *，\\ s * \\ w +）*然后不要去理会捕获组，而只是将它们分开比赛。

And for anything more complicated re: NLP, GATE is a pretty powerful tool. 对于更复杂的re：NLP，GATE是一个非常强大的工具。 The learning curve is steep at times but you have a whole industry of science-guys to draw from: http://gate.ac.uk/ 学习曲线有时会很陡峭，但是您可以从以下整个行业中汲取经验： http : //gate.ac.uk/