简体   繁体   English

正则表达式C#可选组-应该贪心吗?

[英]regex c# optional group - should act greedy?

having regex ~like this: 正则表达式〜像这样:

blablabla.+?(?:<a href="(http://.+?)" target="_blank">)?

I want to capture an url if I find one... finds stuff but I don't get the link (capture is always empty). 如果要找到一个URL,我想捕获一个URL ...找到内容,但没有链接(捕获始终为空)。 Now if I remove the question mark at the end like this 现在,如果我这样删除结尾的问号

blablabla.+?(?:<a href="(http://.+?)" target="_blank">)

This will only match stuff that has the link at the end... it's 2.40 am... and I've got no ideas... 这只会匹配结尾处有链接的内容...是2.40 am ...我不知道...

--Edit-- - 编辑 -

sample input: 样本输入:

blablabla asd 1234t535 <a href="http://google.com" target="_blank">

expected output: 预期输出:

match 0:

    group 1: <a href="http://google.com" target="_blank">
    group 2: http://google.com`

I just want "http://google.com" or "" 我只想要“ http://google.com”或“”

Are you doing a whole-string match? 您正在做一整串比赛吗? If so, try adding .* to the end of the first regex and see what it matches. 如果是这样,请尝试在第一个正则表达式的末尾添加.* ,并查看其匹配项。 The problem with the first regex is that it can match anything after blablabla because of the .+? 第一个正则表达式的问题在于,由于.+? ,它可以匹配blablabla之后的所有内容.+? (leading to an empty capture), but the parenthesized part still won't match an a tag unless it's at the end of the string. (导致空捕获),但括号内的部分仍然不匹配的a ,除非它是在字符串的结束标记。 By the way, looking at your expected output, capture 1 will be the URL; 顺便说一下,查看您的预期输出,捕获1将是URL; the parentheses around the whole HTML tag are non-capturing because of the ?: at the beginning. 整个HTML标记的括号由于开头的?:而无法捕获。

you shouldn't need .+? 您不需要。+? at the start, the regex is going to search the whole input anyway 在开始时,正则表达式仍然会搜索整个输入

you also have the closing '>' right after blank which will limit your matches 您还可以在空格后紧跟'>',这将限制您的比赛

(?:<a href="(http://.+?)" target="_blank".*?>)

regex test 正则表达式测试

It's the trailing ? 是拖尾吗? that's doing you in. Reason: By marking it as optional, you're allowing the .+? 原因:通过将其标记为可选,您就允许。+吗? to grab it. 抓住它。

blablabla.*(?:<a href="((http://)?.*)".+target="_blank".*>)

I modified it slightly... .+? 我修改了它咯... .+? is basically the same as .* , and if you may have nothing in your href (you indicated you wanted ""), you need to make the http optional as well as the trailing text. 基本上与.*相同,并且如果您的href中没有任何内容(您表示想要的是“”),则需要使http以及尾随的文本成为可选内容。 Also, .* in front target means you have at least one space or character, but may have more (multiple blanks or other attributes). 另外,最前面的target .*表示您至少有一个空格或字符,但可能有更多(多个空格或其他属性)。 .* before the > means you can have blanks or other attributes trailing after. >之前的.* ,表示后面可以有空格或其他属性。

This will not match a line at all if there's no <a href...> , but that's what you want, right? 如果没有<a href...> ,则根本不会匹配任何<a href...> ,但这就是您想要的,对吗?

The (?: ... ) can be dropped completely, if you don't need to capture the whole <a href...> portion. 如果您不需要捕获整个<a href...>部分,则可以完全删除(?: ... )

This will fail if the attributes are not listed in the order specified... which is one of the reasons regex can't really be used to parse html. 如果未按指定的顺序列出属性,则此操作将失败...这是不能真正使用regex解析html的原因之一。 But if you're certain the href will always come before the target, this should do what you need. 但是,如果您确定href总是会出现在目标之前,那么这应该可以满足您的需求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM