繁体   English   中英

CURLOPT_RETURNTRANSFER返回HTML字符串

[英]CURLOPT_RETURNTRANSFER returns HTML in string

我正在尝试使用CURL DOMDocument或Xpath解析HTML,但是CURLOPT_RETURNTRANSFER始终以字符串形式返回URL的HTML,这使其成为无效的HTML进行解析

返回的输出:

string(102736) "<!DOCTYPE html>


    <html itemscope itemtype="http://schema.org/QAPage" class="html__responsive">

    <head>

        <title>html - PHP outputting text WITHOUT echo/print? - Stack Overflow</title>
        <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
        <link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
        <link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
        <meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0">"

PHP snipe查看输出

$cc = $http->get($url);
var_dump($cc);

使用的CURL库: https : //github.com/seikan/HTTP/blob/master/class.HTTP.php

当我删除CURLOPT_RETURNTRANSFER时,我看到没有字符串的HTML(102736),但是即使我不请求,它也会回显url(参考: curl_exec不想打印的结果

这是我用来解析html的PHP代码:

  $cc = $http->get($url);
  $doc = new \DOMDocument();
  $doc->loadHTML($cc);

  // all links in document
  $links = [];
  $arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
  foreach($arr as $item) { // DOMElement Object
    $href =  $item->getAttribute("href");
    $text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
    $links[] = [
      'href' => $href,
      'text' => $text
    ];
  }

任何想法?

检查返回值-

print_r($cc);

您可能会发现输出是一个数组(如果代码成功运行)。 从库源代码中get()get()的返回是...

return [
    'header' => $headers,
    'body'   => substr($response, $size),
];

因此,您需要将负载线更改为...

$doc->loadHTML($cc['body']);

更新:

作为上述示例,并将此问题用作与之配合使用的页面...

$cc = $http->get("https://stackoverflow.com/questions/51319473/curlopt-returntransfer-returns-html-in-string/51319585?noredirect=1#comment89619183_51319585");
$doc = new \DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($cc['body']);

// all links in document
$links = [];
$arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
foreach($arr as $item) { // DOMElement Object
    $href =  $item->getAttribute("href");
    $text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
    $links[] = [
        'href' => $href,
        'text' => $text
    ];
}

print_r($links);

输出...

Array
(
    [0] => Array
        (
            [href] => #
            [text] => 
        )

    [1] => Array
        (
            [href] => https://stackoverflow.com
            [text] => Stack Overflow
        )

    [2] => Array
        (
            [href] => #
            [text] => 
        )

    [3] => Array
        (
            [href] => https://stackexchange.com/users/?tab=inbox
...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM