简体   繁体   English

我为什么不在这里找回任何图片?

[英]Why am I not getting back any images here?

$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$html = @file_get_contents($url);

$doc = new DOMDocument();
@$doc->loadHTML($html);
$xml = @simplexml_import_dom($doc);
$images = $xml->xpath('//img');

var_dump($images);
die();

Output is: 输出是:

array(0) { }

However, in the page source I see this: 但是,在页面源代码中我看到了:

<img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" />

Edit: It appears $html 's contents stop at the <body> tag for this page. 编辑:看来$html的内容停在此页面的<body>标签上。 Any idea why? 知道为什么吗?

It appears $html's contents stop at the tag for this page. 看来$ html的内容停在此页面的标签上。 Any idea why? 知道为什么吗?

Yes, you must provide this page with a valid user agent. 是的,您必须为此页面提供有效的用户代理。

$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_exec($ch);

outputs everything to the ending </html> including your requested <img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" /> 将所有内容输出到结尾</html>包括你要求的<img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" /> <img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" />

When a simple wget or curl without the user agent returns only up to the <body> tag. 当没有用户代理的简单wget或curl仅返回<body>标记时。

$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);

$doc = new DOMDocument();
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
$images = $xml->xpath('//img');

var_dump($images);
die();

EDIT: My first post stated that there was still an issue with xpath... I was just not doing my due diligence and the updated code above works great. 编辑: 我的第一篇文章说xpath还有一个问题......我只是没有做尽职调查,上面的更新代码效果很好。 I forgot to force curl to output to a string rather then print to the screen(as it does by default). 我忘了强制curl输出到字符串而不是打印到屏幕(默认情况下)。

Why bring simplexml into the mix? 为什么要将simplexml加入混合? You're already loading the HTML from w3fools into the DOM class, which has a perfectly good XPath query engine in it already. 您已经将w3fools中的HTML加载到DOM类中,该类已经有一个非常好的XPath查询引擎。

[...snip...]
$doc->loadHTML($html);
$xpath = new DOMXPath($doc)
$images = $xpath->xpath('//img');
[...snip...]

The IMG tag is generated by javascript. IMG标记由javascript生成。 If you'd downloaded this page via wget, you'd realize there is no IMG tag in the HTML. 如果您通过wget下载了此页面,您会发现HTML中没有IMG标记。

Update #1 更新#1

I believe it is because of user agent string. 我相信这是因为用户代理字符串。 If I supply "Mozilla/5.0 (X11; Linux i686 on x86_64; rv:2.0) Gecko/20100101 Firefox/4.0" as user agent id, I get the page in whole. 如果我提供“Mozilla / 5.0(X11; Linux i686 on x86_64; rv:2.0)Gecko / 20100101 Firefox / 4.0”作为用户代理ID,我会得到整个页面。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM