正则表达式问题：直到下一个匹配项或文档末尾

Question

I'm working on a document parser to extract data from some documents that I've been given and I'm coding in C#. 我正在研究一个文档解析器，以从已经给出的一些文档中提取数据，并且我正在使用C＃进行编码。 The documents are in the form: 这些文件的格式为：


(Type 1): (potentially multi-lined string)
(Type 2): (potentially multi-lined string)
(Type 3): (potentially multi-lined string)
...
(Type N): (potentially multi-lined string)
(Type 1): (potentially multi-lined string)
...
End Of Document.

The document repeats (Type 1)-(Type N) M times in the same format 文档以相同格式重复（类型1）-（类型N）M次

I'm having trouble with the multi-lined strings and finding the last iteration of (Type 1)-(Type N) 我在使用多行字符串时遇到麻烦，并找到了（Type 1）-（Type N）的最后一次迭代

What I need to do is capture the (potentially multi-lined string) in a group named by its preceeding (Type #) 我需要做的是在其前面的（类型＃）命名的组中捕获（可能是多行字符串）

Here is a snippet of the document that I'm trying to match: 这是我要匹配的文档片段：

Name: John Dow
Position: VP. over Development
Bio: Here is a really long string of un important stuff
that could include words like "Bio" or "Name".  Some times I have problems
here, but for the most part it should be normal Bio information
Position History: Vp. over Development
Sr. Project Manager
Jr. Project Manager
Developer
Peon
Notes: Here are some notes that may or may not be multilined
and if it is, all the lines need to be captured for this person.
Name: Joe Noob
Position: Peon
Bio: I'm a peon, so I have little bio
Position History: Peon
Notes: few notes
Name: Jane Smith
Position: VP. over Sales
Bio: Here is a really long string of more un important stuff
that could include words like "Bio" or "Name".  Some times I have problems
here, but for the most part it should be normal Bio information
Position History: Vp. over Sales
Sales Manager
Secretary
Notes: Here are some notes that may or may not be multilined
and if it is, all the lines need to be captured for this person.

The order of (type #) is always the same and they're always preceeded by a newline. （type＃）的顺序始终相同，并且始终以换行符开头。

What I have: 我有的：

Name:\s(?:(?.*?)\r\n)+?Position:\s(?:(?.*?)\r\n)+?Bio:\s(?:(?.*?)\r\n)+?Position History:\s(?:(?.*?)\r\n)+?Notes:\s(?:(?.*?)\r\n)+?

Any help would be great! 任何帮助将是巨大的！

Answer 1

Because you're using lazy matching, the last token takes only as much as it must. 因为您使用的是惰性匹配，所以最后一个令牌只占用必需的时间。 You can solve that by adding a lookahed at the end of your pattern, to match until the next token: 您可以通过在样式的末尾添加lookahed来解决该问题，直到下一个标记匹配：

(?=^Name:|$)

Here's the full regex: 这是完整的正则表达式：

Name:\s(?:(.*?)\s+)Position:\s(?:(.*?)\s+)Bio:\s(?:(.*?)\s+)Position History:\s(?:(.*?)\s+)Notes:\s(?:(.*?)\s+)(?=^Name:|$)

Example: http://regexhero.net/tester/?id=92982feb-806f-4d0a-96a3-5ef6689a0e01 示例： http ： //regexhero.net/tester/？id = 92982feb-806f-4d0a-96a3-5ef6689a0e01

Answer 2

try this one: 试试这个：

(?'tag'[\w\s]+):\s*(?'val'.*([\r\n][^:]*)*)

I just gruped as named group 'tag' the label preceding the ':' and as value the (potential) multiline text. 我只是将'：'之前的标签作为命名组'tag'进行了抱怨，并将（潜在的）多行文本作为值。

Answer 3

The simplest fix would be to do the match it in right-to-left mode: 最简单的解决方法是在从右到左的模式下进行匹配：

Regex r = new Regex(@"Name:\s(?:(.*?)\r\n)+?" +
                    @"Position:\s(?:(.*?)\r\n)+?" +
                    @"Bio:\s(?:(.*?)\r\n)+?" +
                    @"Position History:\s(?:(.*?)\r\n)+?" +
                    @"Notes:\s(?:(.*?)\r\n)+?",
                    RegexOptions.Singleline | RegexOptions.RightToLeft);

By the way, I had to delete a bunch of inappropriate question marks to make it work at all. 顺便说一句，我不得不删除一些不合适的问号以使其完全起作用。 You did want those groups to capture, didn't you? 您确实希望这些团体被俘虏，不是吗？

正则表达式问题：直到下一个匹配项或文档末尾

问题描述

3 个解决方案

解决方案1
3 2011-01-25 17:08:01

解决方案2
2 2011-01-25 16:54:57

解决方案3
2 已采纳 2011-01-25 17:29:59

正则表达式问题：直到下一个匹配项或文档末尾

问题描述

3 个解决方案

解决方案1 3 2011-01-25 17:08:01

解决方案2 2 2011-01-25 16:54:57

解决方案3 2 已采纳 2011-01-25 17:29:59

解决方案1
3 2011-01-25 17:08:01

解决方案2
2 2011-01-25 16:54:57

解决方案3
2 已采纳 2011-01-25 17:29:59