简体   繁体   English

从HTML内容中提取数据

[英]Extract the data from content of HTML

I'm trying to extract data from HTML. 我正在尝试从HTML提取数据。 I did it with curl, but all I need is to pass the title to another variable: 我用curl做到了,但是我所需要的只是将标题传递给另一个变量:

<meta  property="og:url" content="https://example.com/">

How to extract this, and is there a better way? 如何提取它,还有更好的方法吗?

You should use a parser to pull values out of HTML files/strings/docs. 您应该使用解析器从HTML文件/字符串/文档中提取值。 Here's an example using the domdocument. 这是使用domdocument的示例。

$string = '<meta  property="og:url" content="https://example.com/">';
$doc = new DOMDocument();
$doc->loadHTML($string);
$metas = $doc->getElementsByTagName('meta');
foreach($metas as $meta) {
    if($meta->getAttribute('property') == 'og:url') {
        echo $meta->getAttribute('content');
    }
}

Output: 输出:

https://example.com/ https://example.com/

If you are loading the HTML from a remote location and not a local string you can use DOM for this using something like: 如果要从远程位置而不是本地字符串加载HTML,则可以使用DOM来实现,例如:

libxml_use_internal_errors(TRUE);
$dom = new DOMDocument;
$dom->loadHTMLFile('https://evernote.com');
libxml_clear_errors();
$xp = new DOMXpath($dom);
$nodes = $xp->query('//meta[@property="og:url"]');
if(!is_null($nodes->item(0)->attributes)) {
    foreach ($nodes->item(0)->attributes as $attr) {
        if($attr->value!="og:url") {
            print $attr->value; 
        }
    }
}

This outputs the expected value: 这将输出期望值:

https://evernote.com/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM