[英]Regular expression question: Until next match OR End Of Document
I'm working on a document parser to extract data from some documents that I've been given and I'm coding in C#. 我正在研究一个文档解析器,以从已经给出的一些文档中提取数据,并且我正在使用C#进行编码。 The documents are in the form: 这些文件的格式为:
(Type 1): (potentially multi-lined string)
(Type 2): (potentially multi-lined string)
(Type 3): (potentially multi-lined string)
...
(Type N): (potentially multi-lined string)
(Type 1): (potentially multi-lined string)
...
End Of Document.
The document repeats (Type 1)-(Type N) M times in the same format 文档以相同格式重复(类型1)-(类型N)M次
I'm having trouble with the multi-lined strings and finding the last iteration of (Type 1)-(Type N) 我在使用多行字符串时遇到麻烦,并找到了(Type 1)-(Type N)的最后一次迭代
What I need to do is capture the (potentially multi-lined string) in a group named by its preceeding (Type #) 我需要做的是在其前面的(类型#)命名的组中捕获(可能是多行字符串)
Here is a snippet of the document that I'm trying to match: 这是我要匹配的文档片段:
Name: John Dow Position: VP. over Development Bio: Here is a really long string of un important stuff that could include words like "Bio" or "Name". Some times I have problems here, but for the most part it should be normal Bio information Position History: Vp. over Development Sr. Project Manager Jr. Project Manager Developer Peon Notes: Here are some notes that may or may not be multilined and if it is, all the lines need to be captured for this person. Name: Joe Noob Position: Peon Bio: I'm a peon, so I have little bio Position History: Peon Notes: few notes Name: Jane Smith Position: VP. over Sales Bio: Here is a really long string of more un important stuff that could include words like "Bio" or "Name". Some times I have problems here, but for the most part it should be normal Bio information Position History: Vp. over Sales Sales Manager Secretary Notes: Here are some notes that may or may not be multilined and if it is, all the lines need to be captured for this person.
The order of (type #) is always the same and they're always preceeded by a newline. (type#)的顺序始终相同,并且始终以换行符开头。
What I have: 我有的:
Name:\s(?:(?.*?)\r\n)+?Position:\s(?:(?.*?)\r\n)+?Bio:\s(?:(?.*?)\r\n)+?Position History:\s(?:(?.*?)\r\n)+?Notes:\s(?:(?.*?)\r\n)+?
Any help would be great! 任何帮助将是巨大的!
Because you're using lazy matching, the last token takes only as much as it must. 因为您使用的是惰性匹配,所以最后一个令牌只占用必需的时间。 You can solve that by adding a lookahed at the end of your pattern, to match until the next token: 您可以通过在样式的末尾添加lookahed来解决该问题,直到下一个标记匹配:
(?=^Name:|$)
Here's the full regex: 这是完整的正则表达式:
Name:\s(?:(.*?)\s+)Position:\s(?:(.*?)\s+)Bio:\s(?:(.*?)\s+)Position History:\s(?:(.*?)\s+)Notes:\s(?:(.*?)\s+)(?=^Name:|$)
Example: http://regexhero.net/tester/?id=92982feb-806f-4d0a-96a3-5ef6689a0e01 示例: http : //regexhero.net/tester/?id = 92982feb-806f-4d0a-96a3-5ef6689a0e01
try this one: 试试这个:
(?'tag'[\w\s]+):\s*(?'val'.*([\r\n][^:]*)*)
I just gruped as named group 'tag' the label preceding the ':' and as value the (potential) multiline text. 我只是将':'之前的标签作为命名组'tag'进行了抱怨,并将(潜在的)多行文本作为值。
The simplest fix would be to do the match it in right-to-left mode: 最简单的解决方法是在从右到左的模式下进行匹配:
Regex r = new Regex(@"Name:\s(?:(.*?)\r\n)+?" +
@"Position:\s(?:(.*?)\r\n)+?" +
@"Bio:\s(?:(.*?)\r\n)+?" +
@"Position History:\s(?:(.*?)\r\n)+?" +
@"Notes:\s(?:(.*?)\r\n)+?",
RegexOptions.Singleline | RegexOptions.RightToLeft);
By the way, I had to delete a bunch of inappropriate question marks to make it work at all. 顺便说一句,我不得不删除一些不合适的问号以使其完全起作用。 You did want those groups to capture, didn't you? 您确实希望这些团体被俘虏,不是吗?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.