简体   繁体   English

正则表达式提取匹配组

[英]Regex Extract Match Groups

(50798.3 vol 1 of 14-page 113) (50798.3第1卷,共14页113)

The above is my clipboard content. 以上是我的剪贴板内容。 As in my previous question , I extracted 50798.3 vol 1 of 14 and saved it in pdf_name and 113 as pagenumber. 与上一个问题一样 ,我提取了50798.3第1卷(共14个),并将其保存在pdf_name和113中作为页码。 This worked well. 这很好。

            var current_clipboard = Clipboard.GetText();
            var regEx = @"^\((?<Desc>[^-]*)-page\s(?<Page>\d+)";
            var match = Regex.Match(current_clipboard, regEx);
            string pdf_name = match.Groups["Desc"].Value;
            string pagenumber = match.Groups["Page"].Value;

Now, for a variation of the clipboard content where the -page would be of the format _Page or _Pages, I have used the below and it is not working. 现在,对于-page格式为_Page或_Pages的剪贴板内容的变体,我使用了以下内容,但它不起作用。 By not working, I mean when I use MessageBox.Show for pdf_name and pagenumber, the message box is displaying blank messages. 不工作,是指当我将MessageBox.Show用于pdf_name和pagenumber时,消息框显示空白消息。 Also, when I test the RegEx, it is showing 3 groups as here . 另外,当我测试RegEx时,它显示3个组,如下所示 I don't need the second match group. 我不需要第二个比赛组。

            var current_clipboard = Clipboard.GetText();
            var regEx = @"^\((?<Desc>[^-]*)_pag(e|es)\s(?<Page>\d+)";
            var match = Regex.Match(current_clipboard, regEx);
            string pdf_name = match.Groups["Desc"].Value;
            string pagenumber = match.Groups["Page"].Value;

So, I am doing something wrong. 所以,我做错了。 Please help me save the correct values to pdf_name and pagenumber. 请帮助我将正确的值保存到pdf_name和pagenumber。

Edit# 编辑#

@Jerry @杰瑞

I tried your version as below. 我尝试了以下版本。

            var current_clipboard = Clipboard.GetText();
            var regEx = @"^\((?<Desc>[^-]*)_pages?\s(?<Page>\d+)";
            var match = Regex.Match(current_clipboard, regEx);
            string pdf_name = match.Groups["Desc"].Value;
            string pagenumber = match.Groups["Page"].Value;
            MessageBox.Show(pdf_name);
            MessageBox.Show(pagenumber);

Unfortunately the message box is returning blank messages. 不幸的是,消息框返回空白消息。

The issue seems to be that, given you no longer have a - separator before your pages, your initial [^-]* pattern is gobbling up your whole string. 问题似乎是,由于您在页面前不再使用-分隔符,因此最初的[^-]*模式正在吞噬整个字符串。

If underscores don't appear in your description, you should replace [^-]* with [^_]* . 如果下划线未出现在说明中,则应将[^-]*替换为[^_]* Alternatively, use lazy matching: (?<Desc>.*?) . 或者,使用惰性匹配: (?<Desc>.*?)

You're capturing (e|es) as the 2nd group. 您正在捕获(e|es)作为第二组。

Change it to a non-capturimg group 将其更改non-capturimg group

(?:e|es)

Non-Capturing Groups: (?: Often, you need parentheses in order to write an expression that makes sense. Normally, parentheses capture what they match. Non-capturing groups allow you to use parentheses without capturing anything. Watch out, as the syntax closely resembles that for a lookahead. 非捕获组:(?:通常,您需要括号才能编写有意义的表达式。通常,括号捕获它们匹配的内容。非捕获组使您可以使用括号而不捕获任何内容。请注意语法与前瞻非常相似。

Non-Capturing Group Pattern: (?:Bob) Matches Bob, but Bob is not captured. 非捕获组模式:(?:Bob)匹配Bob,但是没有捕获Bob。

http://www.rexegg.com/regex-disambiguation.html http://www.rexegg.com/regex-disambiguation.html

Regex are by default, case sensitive, meaning that p will match only p and not P . 正则表达式默认情况下区分大小写,这意味着p仅匹配p而不匹配P If you want a case insensitive regex, then you can either use the RegexOptions.IgnoreCase or the inline modifier (?i) , or you use [Pp] in your regex, which will match either p or P (but the other letters will be match in a case sensitive manner. 如果您需要一个不区分大小写的正则表达式,则可以使用RegexOptions.IgnoreCase或inline修饰符(?i) ,也可以在正则表达式中使用[Pp] ,它将匹配pP (但其他字母为以区分大小写的方式进行匹配。

With the option, the line with .Match will change: 使用该选项,带有.Match行将更改:

var match = Regex.Match(current_clipboard, regEx, RegexOptions.IgnoreCase);

With the inline modifier, the regex will change: 使用内联修饰符,正则表达式将更改:

var regEx = @"(?i)^\((?<Desc>[^-]*)_pag(e|es)\s(?<Page>\d+)";

With the character class, the regex will change: 对于字符类,正则表达式将更改:

var regEx = @"^\((?<Desc>[^-]*)_[Pp]ag(e|es)\s(?<Page>\d+)";

To your next issue, the basic way to avoid a capture is to use a non-capture group. 对于下一个问题,避免捕获的基本方法是使用非捕获组。 Here, you have (e|es) which is a capture group. 在这里,您有(e|es)这是一个捕获组。 Change that to (?:e|es) : 将其更改为(?:e|es)

var regEx = @"^\((?<Desc>[^-]*)_[Pp]ag(?:e|es)\s(?<Page>\d+)";

Though really, you don't need an alternation here. 虽然确实如此,但您在这里不需要更改。 You can use the ? 您可以使用? quantifier meaning 0 or 1 times: 量词的含义是0或1倍:

var regEx = @"^\((?<Desc>[^-]*)_[Pp]ages?\s(?<Page>\d+)";

Example with the option and ? 带选项和?示例 quantifier: 量词:

var current_clipboard = Clipboard.GetText();
var regEx = @"^\((?<Desc>[^-]*)_pages?\s(?<Page>\d+)";
var match = Regex.Match(current_clipboard, regEx, RegexOptions.IgnoreCase);
string pdf_name = match.Groups["Desc"].Value;
string pagenumber = match.Groups["Page"].Value;

And here 's another regex tester site that supports this syntax for named capture groups. 这里的支持该语法命名捕捉组另一个正则表达式测试仪现场。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM