简体   繁体   English

preg_match以在锚点上提取mailto

[英]preg_match to extract mailto on anchor

I need to get the email adress from an anchor with a mailto attribute with regex. 我需要从带有regex的mailto属性的锚中获取电子邮件地址。

this pattern: (.*)<a\\s(.*?)(.*)\\s*href\\=['"]mailto:([-a-z0-9_]+)@([a-z0-9-]+).([az]+)['"]>(.*)</a>(.*) 此模式: (.*)<a\\s(.*?)(.*)\\s*href\\=['"]mailto:([-a-z0-9_]+)@([a-z0-9-]+).([az]+)['"]>(.*)</a>(.*)

Works in regex coach though it doesnt work with PHP. 尽管它不适用于PHP,但可以在regex coach中工作。

Code: 码:

preg_match("'(.*)<a (.*?)(.*) *href\=['\"]mailto:([-a-z0-9_]+)@([a-z0-9-]+).([a-z]+)['\"]>(.*)</a>(.*)'si", "<a href=\"mailto:someemail@ohio.com\"">Some email</a>", $matches);

print_r($matches);

So why doenst it work in php? 那么为什么要在php中起作用呢?

PHP's PCRE require the regular expression to be wrapped into delimiters that separate the pattern from optional modifiers . PHP的PCRE要求将正则表达式包装到分隔符中 ,以将模式与可选修饰符分开。 In this case the first non-alphanumeric character is used (ie ' ) so the pattern is actually just (.*)<a (.*?)(.*) *href\\=[ and the rest are treated as modifiers. 在这种情况下,将使用第一个非字母数字字符(即' ),因此模式实际上只是(.*)<a (.*?)(.*) *href\\=[ ,其余部分视为修饰符。 And that is an invalid regular expression as the [ is not properly escaped and the rest are not valid modifiers neither. 这是一个无效的正则表达式,因为[没有正确地转义,其余的都不是有效的修饰符。

As the others have already suggested, you can fix this by escaping any occurrence of the delimiter ' inside the regular expression or choose a different delimiter that does not appear in the regular expression. 正如其他人已经建议的那样,您可以通过在正则表达式中转义分隔符'任何出现来解决此问题,或者选择一个不在正则表达式中出现的分隔符。

But besides that, trying to parse HTML with regular expressions is very error prone. 但是除此之外,尝试使用正则表达式解析HTML非常容易出错。 In you case using that many .* will also result in a horrible performance behavior (it's just due to how regular expressions are processed). 在这种情况下,使用那么多.*也会导致可怕的性能行为(这仅是由于正则表达式的处理方式所致)。

Better use a proper HTML parser that returns a DOM that can be queried like PHP's DOM library : 最好使用适当的HTML解析器来返回可以像PHP的DOM库一样查询的DOM

$doc = new DomDocument();
$doc->loadHTML($str);
foreach ($doc->getElementsByTagName("a") as $a) {
    if ($a->hasAttribute("href")) {
        $href = trim($a->getAttribute("href"));
        if (strtolower(substr($href, 0, 7)) === 'mailto:') {
            $components = parse_url($href);
        }
    }
}

Your delimiter is a quote ' , and there are some instances of it in the regex: 您的定界符是一个引号' ,并且在正则表达式中有一些实例:

preg_match("'(.*)<a (.*?)(.*) *href\=['\"]mailto:([-a-z0-9_]+)@([a-z0-9-]+).([a-z]+)['\"]>(.*)</a>(.*)'si", "<a href=\"mailto:someemail@ohio.com\"">Some email</a>", $matches);
                                      ^                                              ^

Escape them (ie: \\' ) or change your delimiter. 转义它们(即: \\' )或更改定界符。

if (preg_match('#<a\s.*?href=[\'"]mailto:([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6})[\'"].*?>.*?</a>#i', $subject, $regs)) {
    $result = $regs[0];
} else {
    $result = "";
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM