简体   繁体   English

使用PHP匹配特定的URL模式

[英]Matching a Specific URL Pattern with PHP

I'm trying to read an HTML file and capture all anchor tags that match a specific URL pattern in order to display those links on another page. 我正在尝试读取HTML文件并捕获与特定URL模式匹配的所有定位标记,以便在另一页上显示这些链接。 The pattern looks like this: 该模式如下所示:

https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web

I'm lousy with RegEx. 我对RegEx不满意。 I've tried a bunch of things and read a bunch of answers here on Stack Overflow, but I'm not hitting on the correct syntax. 我已经尝试了很多方法并在Stack Overflow上阅读了一堆答案,但是我没有找到正确的语法。

Here's what I have now: 这是我现在所拥有的:

preg_match ('/<a href="https:\/\/docs.google.com\/file\/d\/(.*)<\/a>/', $file, $matches)

When I test this on an HTML page with two matching anchor tags, the first result includes the first and second match and everything in between, while the second result includes part of the first match, part of the second match, and everything in between. 当我在具有两个匹配的定位标记的HTML页面上对此进行测试时,第一个结果包括第一个和第二个匹配项以及两者之间的所有内容,而第二个结果包括第一个匹配项的一部分,第二个匹配项的一部分以及两者之间的所有内容。

While I'd be happy to capture matching anchor tags along with the inner HTML, I'd be even happier if I could generate a multidimensional array with the HREF attribute of each matching anchor tag, along with the matching inner HTML (so I can format the links myself, without having to use even more RegEx to get rid of unwanted attributes). 虽然我很乐意将匹配的锚定标记与内部HTML一起捕获,但是如果我能够使用每个匹配的锚定标记的HREF属性以及匹配的内部HTML生成多维数组,我会更加高兴(因此,我可以自己格式化链接,而不必使用更多的RegEx摆脱不必要的属性)。 Would I use preg_match_all for that? 我会为此使用preg_match_all吗? What would that look like? 那会是什么样?

Am I even on the right path here, or should I be using DOM and XPath queries to find this stuff? 我是否在这里正确的道路上,还是应该使用DOM和XPath查询来找到这些东西?

Thanks. 谢谢。

Oh jeez, I can't believe every answer here uses "/" delimiters. 哎呀,我不敢相信这里的每个答案都使用“ /”定界符。 If your pattern has slashes in it, use something else for the sake of readability. 如果您的模式中包含斜线,请使用其他名称以提高可读性。

Here's a better answer (you may need to tweak if your anchors may have additional attributes other than href): 这是一个更好的答案(如果您的锚可能具有href以外的其他属性,则可能需要进行调整):

$hrefPattern = "(?P<href>https://docs\.google\.com/file/d/[a-z0-9]+/edit\?usp=drive_web)";
$innerPattern = "(?P<inner>.*?)";
$anchorPattern = "<a href=\"$hrefPattern\">$innerPattern</a>";
preg_match_all("@$anchorPattern@i", $file, $matches);

This will give you something like: 这将为您提供以下信息:

[
    0 => ['<a href="https://docs.google.com/file/d/foo/edit?usp=drive_web"><span>More foo</span></a>'],
    "href" => ["https://docs.google.com/file/d/foo/edit?usp=drive_web"],
    "inner" => ["<span>More foo</span>"]
]

And absolutely, you should use the DOM for this. 当然,您应该为此使用DOM。

(.*?)替换(.*) -使用惰性量化:

preg_match('/<a href="https:\/\/docs.google.com\/file\/d\/(.*?)<\/a>/', $file, $matches);

Dave, 戴夫

The DOM would be better. DOM会更好。 But here is the Regex that works. 但是这是有效的正则表达式。

$url = 'href="https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web"';

preg_match ('/href="https:\/\/docs.google.com\/file\/d\/(.*?)"/', $url, $matches);

Results: 结果:

array (size=2)
    0 => string 'href="https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web"' (length=82)
    1 => string 'aBunchOfLettersAndNumbers/edit?usp=drive_web' (length=44)

You can can the html tags, but most importantly, in your question, your code in the preg_match line didn't contain the ending > of the opening tag which threw it off and it needed to have (. ?) instead of (. ). 您可以使用html标记,但是最重要的是,在您的问题中,preg_match行中的代码不包含将开始标记扔掉的开始标记的结尾>,并且需要使用(。 ?)而不是(。 )。 。 The added ? 增加了吗? tells it to looking for any characters, of an unknown quantity. 告诉它寻找数量未知的任何字符。 (.*) means any one character I believe. (。*)表示我相信的任何一个字符。

You could use the following regular expression: 您可以使用以下正则表达式:

/<a.*?href="(https:\/\/docs\.google\.com\/file\/d\/.*?)".*?>(.*?)<\/a>/

Which would give you the URL from the href and the innerHTML . 这将为您提供来自hrefinnerHTML的URL。

Break down 分解

<a.*?href=" Matches the opening a tag and any charachters up until href=" <a.*?href="匹配开头a标签和所有字符,直到href="

(https:\\/\\/docs\\.google\\.com\\/file\\/d\\/.*?)" Matches (and captures) until the end of the href (ie until " (https:\\/\\/docs\\.google\\.com\\/file\\/d\\/.*?)"匹配(并捕获)直到href结束(即直到"

.*?> Matches all characters to the end of the a tag > .*?>匹配所有字符,以结束a标签>

(.*?)<\\/a> Matches (and captures) the innerHTML until the closing a tag (ie </a> ). (.*?)<\\/a>件(和捕获)的innerHTML ,直到关闭a标签(即</a> )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM