我为什么不在这里找回任何图片？

Question

$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$html = @file_get_contents($url);

$doc = new DOMDocument();
@$doc->loadHTML($html);
$xml = @simplexml_import_dom($doc);
$images = $xml->xpath('//img');

var_dump($images);
die();

Output is: 输出是：

array(0) { }

However, in the page source I see this: 但是，在页面源代码中我看到了：

<img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" />

Edit: It appears $html 's contents stop at the <body> tag for this page. 编辑：看来$html的内容停在此页面的<body>标签上。 Any idea why? 知道为什么吗？

Answer 1

It appears $html's contents stop at the tag for this page. 看来$ html的内容停在此页面的标签上。 Any idea why? 知道为什么吗？

Yes, you must provide this page with a valid user agent. 是的，您必须为此页面提供有效的用户代理。

$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_exec($ch);

outputs everything to the ending </html> including your requested <img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" /> 将所有内容输出到结尾</html>包括你要求的<img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" /> <img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" />

When a simple wget or curl without the user agent returns only up to the <body> tag. 当没有用户代理的简单wget或curl仅返回<body>标记时。

$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);

$doc = new DOMDocument();
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
$images = $xml->xpath('//img');

var_dump($images);
die();

EDIT: My first post stated that there was still an issue with xpath... I was just not doing my due diligence and the updated code above works great. 编辑： 我的第一篇文章说xpath还有一个问题......我只是没有做尽职调查，上面的更新代码效果很好。 I forgot to force curl to output to a string rather then print to the screen(as it does by default). 我忘了强制curl输出到字符串而不是打印到屏幕（默认情况下）。

Answer 2

Why bring simplexml into the mix? 为什么要将simplexml加入混合？ You're already loading the HTML from w3fools into the DOM class, which has a perfectly good XPath query engine in it already. 您已经将w3fools中的HTML加载到DOM类中，该类已经有一个非常好的XPath查询引擎。

[...snip...]
$doc->loadHTML($html);
$xpath = new DOMXPath($doc)
$images = $xpath->xpath('//img');
[...snip...]

Answer 3

The IMG tag is generated by javascript. IMG标记由javascript生成。 If you'd downloaded this page via wget, you'd realize there is no IMG tag in the HTML. 如果您通过wget下载了此页面，您会发现HTML中没有IMG标记。

Update #1 更新＃1

I believe it is because of user agent string. 我相信这是因为用户代理字符串。 If I supply "Mozilla/5.0 (X11; Linux i686 on x86_64; rv:2.0) Gecko/20100101 Firefox/4.0" as user agent id, I get the page in whole. 如果我提供“Mozilla / 5.0（X11; Linux i686 on x86_64; rv：2.0）Gecko / 20100101 Firefox / 4.0”作为用户代理ID，我会得到整个页面。

我为什么不在这里找回任何图片？

问题描述

3 个解决方案

解决方案1
9 已采纳 2011-04-26 21:07:29

解决方案2
0 2011-04-19 17:12:25

解决方案3
-1 2011-04-26 14:44:01

Update #1 更新＃1

我为什么不在这里找回任何图片？

问题描述

3 个解决方案

解决方案1 9 已采纳 2011-04-26 21:07:29

解决方案2 0 2011-04-19 17:12:25

解决方案3 -1 2011-04-26 14:44:01

Update #1 更新＃1

解决方案1
9 已采纳 2011-04-26 21:07:29

解决方案2
0 2011-04-19 17:12:25

解决方案3
-1 2011-04-26 14:44:01