简体   繁体   English

Delphi XE2正则表达式:量化器在正向后方内部不起作用?

[英]Delphi XE2 Regex: Quantifier does not work inside positive lookbehind?

I have a complete HTML document string from a web page containing this BASE tag: 我有一个包含此BASE标记的网页中的完整HTML文档字符串:

<BASE href="http://whatreallyhappened.com/">

In Delphi XE2, I use this regular expression with the whole HTML document as subject to get the URL from the BASE tag between the double quotes: 在Delphi XE2中,我将此正则表达式与整个HTML文档一起使用,以从双引号之间的BASE标记获取URL:

BaseURL := TRegEx.Match(HTMLDocStr, '(?<=<base(\s)href=").*(?=")', [roIgnoreCase]).Value;

This works, but only if there is only ONE space character in the subject between BASE and href . 这是可行的,但BASEhref之间的主题中只有一个空格字符。

I tried to add a quantifier to the space part in the regex (\\s) , but it did not work. 我试图在正则表达式(\\s)的空格部分添加一个量词,但是它不起作用。

So how can I make this regex match the URL even if there are several spaces between BASE and href ? 因此,即使BASEhref之间有多个空格,如何使此正则表达式与URL匹配?

You're making this far too complicated by using lookaround. 通过使用环视,这使事情变得太复杂了。 If you want to extract only part of the regex match, simply add a capturing group. 如果只想提取正则表达式匹配项的一部分,只需添加一个捕获组。 Then you can use the text matched by the capturing group instead of the overall match. 然后,您可以使用捕获组匹配的文本,而不是整体匹配的文本。 In most cases you'll also get much better performance this way. 在大多数情况下,您还将通过这种方式获得更好的性能。

To find the base tag in a file and extract its URL you can use the regex <base[^>]+href=["']([^"']*)["'] . Call TRegex.Match() to get a TMatch . This has a Groups property that you can use to retrieve group 1 if a match was found. 要在文件中找到基本标记并提取其URL,可以使用regex <base[^>]+href=["']([^"']*)["'] 。调用TRegex.Match()来获取TMatch ,它具有一个Groups属性,如果找到匹配项,则可以用来检索组1。

With lookaround 环顾四周

You can use different ways to try using quantifiers like these: 您可以使用不同的方法来尝试使用类似以下的数量词:

(?<=<BASE)\s+href=".*(?=")
(?<=<BASE)\s{0,30}href=".*(?=")

Working demo 工作演示

在此处输入图片说明

Without lookaround 没有环顾四周

By the way, if you want just to get the content within href there is no need of lookaround you just can use: 顺便说一句,如果您只想获取href内的内容,则无需四处查看,您可以使用:

<BASE\s+href="(.*?)"

Working demo 工作演示

在此处输入图片说明

EDIT: after reading your comments I figured out a workaround (ugly but could work). 编辑:阅读您的评论后,我想出了一种解决方法(难看,但可以工作)。 You can try using something like this: 您可以尝试使用以下方法:

((?<=<BASE\shref=")|(?<=<BASE\s\shref=")|(?<=<BASE\s\s\shref=")).*(?=")
          ^---notice \s        ^---notice \s\s       ^---notice \s\s\s

I know that this is horrible, but if none of above work you can try with that. 我知道这太可怕了,但是如果以上所有工作都没有,您可以尝试一下。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM