简体   繁体   English

为什么我的正则表达式模式中的条件语句在不应该改变任何东西的情况下摆脱了其他匹配?

[英]Why is the conditional statement in my regex pattern getting rid of other matches when it shouldn't change anything?

Here is my text (it will be looking through other text as well, but this is what I am having trouble with):这是我的文本(它也会查看其他文本,但这是我遇到的问题):

<a href="/wiki/Basketball" title="Basketball">basketball</a>, the 

<li class="interwiki-cs"><a href="//cs.wikipedia.org/wiki/" title="" lang="cs" hreflang="cs">esky</a><

<li class="interwiki-da"><a href="//da.wikipedia.org/wiki/" title="" lang="da" hreflang="da"><b>Dansk</b></a></li>

I'm trying to get 3 matches where 2 groups (separated by a semicolon) are:我正在尝试获得 3 个匹配项,其中 2 个组(用分号分隔)是:

/wiki/Basketball;basketball
//cs.wikipedia.org/wiki/;esky
//da.wikipedia.org/wiki/;Dansk

With this pattern: (?<=<a href=")(.*?)".*?>([\w\s\./,0-9]*?)< , I can match the first two correctly.使用这种模式: (?<=<a href=")(.*?)".*?>([\w\s\./,0-9]*?)< ,我可以正确匹配前两个. To try to also get the last match, I added in a conditional to check for the <b> : (?<=<a href=")(.*?)".*?>(<?)(?(2)b>)([\w\s\./,0-9]*?)< .为了尝试也获得最后一场比赛,我添加了一个条件来检查<b>(?<=<a href=")(.*?)".*?>(<?)(?(2)b>)([\w\s\./,0-9]*?)< . This gets the last match correctly, but now the first two don't match.这正确地获得了最后一个匹配,但现在前两个不匹配。

Can you please explain why this happens and what the correct way to do this is?您能否解释一下为什么会发生这种情况以及正确的方法是什么?

To be honest i have trouble understanding 'conditional' myself.老实说,我自己很难理解“有条件的”。 I asked question about it, but didn't get an answer.我问了这个问题,但没有得到答案。

I took advantage of [^] and did this:我利用[^]并做到了这一点:

re.findall('(?<=<a href=")(.*?)".*>([^>]+)<',string)

or或者

re.findall('(?<=<a href=")(.*?)".*(?<=>)([^>]+)(?=<)',string)

In both cases the second group matches a non-empty string that follows '>', do not contain '>' and preceds '<'.在这两种情况下,第二组都匹配紧随“>”的非空字符串,不包含“>”和前面的“<”。 It should match the last non_empty string between tags.它应该匹配标签之间的最后一个非空字符串。

By addind '?'通过添加“?” to the '.'到“。” after the first group, the second group should match the first non_empty string between tags:在第一组之后,第二组应该匹配标签之间的第一个 non_empty 字符串:

re.findall('(?<=<a href=")(.*?)".*?(?<=>)([^>]+)(?=<)',string)

Separately, the following snippet should catch all non-empty strings between tags:另外,以下代码段应捕获标签之间的所有非空字符串:

re.findall('(?<=>)([^>]+)(?=<)',string)

I hope I'm not wrong, but if that's the case, please tell me.我希望我没有错,但如果是这样,请告诉我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM