简体   繁体   English

正则表达式问题:直到下一个匹配项或文档末尾

[英]Regular expression question: Until next match OR End Of Document

I'm working on a document parser to extract data from some documents that I've been given and I'm coding in C#. 我正在研究一个文档解析器,以从已经给出的一些文档中提取数据,并且我正在使用C#进行编码。 The documents are in the form: 这些文件的格式为:


(Type 1): (potentially multi-lined string)
(Type 2): (potentially multi-lined string)
(Type 3): (potentially multi-lined string)
...
(Type N): (potentially multi-lined string)
(Type 1): (potentially multi-lined string)
...
End Of Document.

The document repeats (Type 1)-(Type N) M times in the same format 文档以相同格式重复(类型1)-(类型N)M次

I'm having trouble with the multi-lined strings and finding the last iteration of (Type 1)-(Type N) 我在使用多行字符串时遇到麻烦,并找到了(Type 1)-(Type N)的最后一次迭代

What I need to do is capture the (potentially multi-lined string) in a group named by its preceeding (Type #) 我需要做的是在其前面的(类型#)命名的组中捕获(可能是多行字符串)

Here is a snippet of the document that I'm trying to match: 这是我要匹配的文档片段:

Name: John Dow
Position: VP. over Development
Bio: Here is a really long string of un important stuff
that could include words like "Bio" or "Name".  Some times I have problems
here, but for the most part it should be normal Bio information
Position History: Vp. over Development
Sr. Project Manager
Jr. Project Manager
Developer
Peon
Notes: Here are some notes that may or may not be multilined
and if it is, all the lines need to be captured for this person.
Name: Joe Noob
Position: Peon
Bio: I'm a peon, so I have little bio
Position History: Peon
Notes: few notes
Name: Jane Smith
Position: VP. over Sales
Bio: Here is a really long string of more un important stuff
that could include words like "Bio" or "Name".  Some times I have problems
here, but for the most part it should be normal Bio information
Position History: Vp. over Sales
Sales Manager
Secretary
Notes: Here are some notes that may or may not be multilined
and if it is, all the lines need to be captured for this person.



The order of (type #) is always the same and they're always preceeded by a newline. (type#)的顺序始终相同,并且始终以换行符开头。

What I have: 我有的:

Name:\s(?:(?.*?)\r\n)+?Position:\s(?:(?.*?)\r\n)+?Bio:\s(?:(?.*?)\r\n)+?Position History:\s(?:(?.*?)\r\n)+?Notes:\s(?:(?.*?)\r\n)+?



Any help would be great! 任何帮助将是巨大的!

Because you're using lazy matching, the last token takes only as much as it must. 因为您使用的是惰性匹配,所以最后一个令牌只占用必需的时间。 You can solve that by adding a lookahed at the end of your pattern, to match until the next token: 您可以通过在样式的末尾添加lookahed来解决该问题,直到下一个标记匹配:

(?=^Name:|$)

Here's the full regex: 这是完整的正则表达式:

Name:\s(?:(.*?)\s+)Position:\s(?:(.*?)\s+)Bio:\s(?:(.*?)\s+)Position History:\s(?:(.*?)\s+)Notes:\s(?:(.*?)\s+)(?=^Name:|$)

Example: http://regexhero.net/tester/?id=92982feb-806f-4d0a-96a3-5ef6689a0e01 示例: http//regexhero.net/tester/?id = 92982feb-806f-4d0a-96a3-5ef6689a0e01

try this one: 试试这个:

(?'tag'[\w\s]+):\s*(?'val'.*([\r\n][^:]*)*)

I just gruped as named group 'tag' the label preceding the ':' and as value the (potential) multiline text. 我只是将':'之前的标签作为命名组'tag'进行了抱怨,并将(潜在的)多行文本作为值。

The simplest fix would be to do the match it in right-to-left mode: 最简单的解决方法是在从右到左的模式下进行匹配:

Regex r = new Regex(@"Name:\s(?:(.*?)\r\n)+?" +
                    @"Position:\s(?:(.*?)\r\n)+?" +
                    @"Bio:\s(?:(.*?)\r\n)+?" +
                    @"Position History:\s(?:(.*?)\r\n)+?" +
                    @"Notes:\s(?:(.*?)\r\n)+?",
                    RegexOptions.Singleline | RegexOptions.RightToLeft);

By the way, I had to delete a bunch of inappropriate question marks to make it work at all. 顺便说一句,我不得不删除一些不合适的问号以使其完全起作用。 You did want those groups to capture, didn't you? 您确实希望这些团体被俘虏,不是吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM