PHP：正则表达式在文件中搜索模式并选择它

Question

I am really confused with regular expressions for PHP. 我真的对PHP的正则表达式感到困惑。

Anyway, I cant read the whole tutorial thing now because I have a bunch of files in html which I have to find links in there ASAP. 无论如何，我现在无法阅读整个教程，因为我有一堆html文件，我必须尽快在其中找到链接。 I came up with the idea to automate it with a php code which it is the language I know. 我想到了用php代码自动化它的想法，这是我所知道的语言。

so I think I can user this script : 所以我想我可以使用这个脚本：

$address = "file.txt"; 
$input = @file_get_contents($address) or die("Could not access file: $address");
$regexp = "??????????"; 
if(preg_match_all("/$regexp/siU", $input, $matches)) { 
    // $matches[2] = array of link addresses 
   // $matches[3] = array of link text - including HTML code 
}

My problem is with $regexp 我的问题是与$regexp

My required pattern is like this: 我需要的模式是这样的：

href="/content/r807215r37l86637/fulltext.pdf" title="Download PDF

I want to search and get the /content/r807215r37l86637/fulltext.pdf from above lines which I have many of them in the files. 我想从上面的行中搜索并获取/content/r807215r37l86637/fulltext.pdf ，在文件中我有很多行。

any help? 有什么帮助吗？

================== ==================

edit 编辑

title attributes are important for me and all of them which I want, are titled 标题属性对我很重要，我想要的所有这些属性都已标题

title="Download PDF" title =“下载PDF”

Answer 1

Once again regexp are bad for parsing html . regexp再次不利于解析html 。

Save your sanity and use the built in DOM libraries. 节省您的理智并使用内置的DOM库。

$dom = new DOMDocument();
@$dom->loadHTML($html);
$x = new DOMXPath($dom);
    $data = array();
foreach($x->query("//a[@title='Download PDF']") as $node)
{
    $data[] = $node->getAttribute("href");
}

Edit Updated code based on ircmaxell comment. 根据ircmaxell注释编辑更新的代码。

Answer 2

try something like this. 尝试这样的事情。 If it does not work, show some examples of links you want to parse. 如果不起作用，请显示您要解析的链接的一些示例。

<?php
$address = "file.txt"; 
$input = @file_get_contents($address) or die("Could not access file: $address");
$regexp = '#<a[^>]*href="([^"]*)"[^>]*title="Download PDF"#'; 

if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) { 
  foreach ($matches as $match) {
    printf("Url: %s<br/>", $match[1]);
  }
}

edit: updated so it searches for Download "PDF entries" only 编辑：已更新，因此仅搜索下载“ PDF条目”

Answer 3

That's easier with phpQuery or QueryPath : 使用phpQuery或QueryPath更容易：

foreach (qp($html)->find("a") as $a) { 
    if ($a->attr("title") == "PDF") {
        print $a->attr("href");
        print $a->innerHTML();
    }
}

With regexps it depends on some consistency of the source: 使用正则表达式取决于源的某些一致性：

preg_match_all('#<a[^>]+href="([^>"]+)"[^>]+title="Download PDF"[^>]*>(.*?)</a>#sim', $input, $m);

Looking for a fixed title="..." attribute is doable, but more difficult as it depends on the position before the closing bracket. 寻找固定的title="..."属性是可行的，但要困难得多，因为它取决于右括号之前的位置。

Answer 4

The best way is to use DomXPath to do the search in one step: 最好的方法是使用DomXPath一步执行搜索：

$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);

$links = array();
foreach($xpath->query('//a[contains(@title, "Download PDF")]') as $node) {
    $links[] = $node->getAttribute("href");
}

Or even: 甚至：

$links = array();
$query = '//a[contains(@title, "Download PDF")]/@href';
foreach($xpath->evaluate($query) as $attr) {
    $links[] = $attr->value;
}

Answer 5

href="([^]+)"将为您提供该表格的所有链接。

PHP：正则表达式在文件中搜索模式并选择它

问题描述

edit 编辑

5 个解决方案

解决方案1
5 已采纳 2011-02-11 20:25:11

解决方案2
1 2011-02-11 20:25:43

解决方案3
1 2011-02-11 20:26:37

解决方案4
1 2011-02-11 20:37:06

解决方案5
0 2011-02-11 20:22:10

PHP：正则表达式在文件中搜索模式并选择它

问题描述

edit 编辑

5 个解决方案

解决方案1 5 已采纳 2011-02-11 20:25:11

解决方案2 1 2011-02-11 20:25:43

解决方案3 1 2011-02-11 20:26:37

解决方案4 1 2011-02-11 20:37:06

解决方案5 0 2011-02-11 20:22:10

解决方案1
5 已采纳 2011-02-11 20:25:11

解决方案2
1 2011-02-11 20:25:43

解决方案3
1 2011-02-11 20:26:37

解决方案4
1 2011-02-11 20:37:06

解决方案5
0 2011-02-11 20:22:10