简体   繁体   English

如何使用PHP下载HTML?

[英]How to download HTML using PHP?

How do I download an HTML file from a URL in PHP, and download all of the dependencies like CSS and Images and store these to my server as files? 如何从PHP下载URL中的HTML文件,并下载CSS和图像等所有依赖项并将这些依赖项存储到我的服务器作为文件? Am I asking for too much? 我要求太多了吗?

The easiest way to do this would be to use wget . 最简单的方法是使用wget It can recursively download HTML and its dependencies. 它可以递归下载HTML及其依赖项。 otherwise you will be parsing the html yourself. 否则你将自己解析HTML。 See Yacoby's answer for details on doing it in pure php. 有关在纯PHP中执行此操作的详细信息,请参阅Yacoby的答案。

I would recommend using a html parsing library to simplify everything. 我建议使用html解析库来简化一切。 Namely something like Simple HTML DOM . 就像简单的HTML DOM一样

Using Simple HTML DOM: 使用简单的HTML DOM:

$html = file_get_html('http://www.google.com/');
foreach($html->find('img') as $element){
    //download image
}

For download files (and html) I would recommend using a HTTP wrapper such as curl, as it allows far more control over using file_get_contents. 对于下载文件(和html),我建议使用一个HTTP包装器,如curl,因为它允许更多地控制使用file_get_contents。 However, if you wanted to use file_get_contents, there are some good examples on the php site of how to get URLs. 但是,如果你想使用file_get_contents,php网站上有一些关于如何获取URL的好例子

The more complex method allows you to specify the headers, which could be useful if you wanted to set the User Agent. 更复杂的方法允许您指定标头,如果您想设置用户代理,这可能很有用。 (If you are scraping other sites a lot, it is good to have a custom user agent as you can use it to let website admin your site or point of contact if you are using too much bandwidth, which is better than the admin blocking your IP address). (如果你正在大量搜索其他网站,最好有一个自定义用户代理,因为你可以使用它来让网站管理你的网站或联系人,如果你使用太多带宽,这比管理员阻止你的IP地址)。

$opts = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"Accept-language: en\r\n"
  )
);

$context = stream_context_create($opts);
$file = file_get_contents('http://www.example.com/', false, $context);

Although of course it can be done simply by: 虽然当然可以简单地通过以下方式完成:

$file = file_get_contents('http://www.example.com/');

The library you want to look at is cURL with PHP . 您要查看的库是使用PHP的cURL cURL performs actions pertaining to HTTP requests (and other networking protocols, but I'd bet HTTP is the most-used.) You can set HTTP cookies, along with GET/POST variables. cURL执行与HTTP请求(以及其他网络协议有关的操作,但我敢打赌HTTP是最常用的。)您可以设置HTTP cookie以及GET / POST变量。

I'm not sure exactly if it will automatically download the dependencies - you might have to download the HTML, parse out the IMG/LINK tags, and then use cURL again to fetch those dependencies. 我不确定它是否会自动下载依赖项 - 您可能必须下载HTML,解析出IMG / LINK标记,然后再次使用cURL来获取这些依赖项。

There are a bazillion tutorials out there on how to do this. 有关如何做到这一点的大量教程。 Here's a simple example (scroll to the bottom) for a basic HTTP GET request from the people who make libcurl (upon which PHP's cURL bindings are based): 这是一个简单的例子 (滚动到底部),以获取来自制作libcurl的人的基本HTTP GET请求(PHP的cURL绑定基于此):

<?php
//
// A very simple example that gets a HTTP page.
//

$ch = curl_init();

curl_setopt ($ch, CURLOPT_URL, "http://www.zend.com/");
curl_setopt ($ch, CURLOPT_HEADER, 0);

curl_exec ($ch);

curl_close ($ch);
?>

You might take a look at the curl wrappers for PHP: http://us.php.net/manual/en/book.curl.php 您可以查看PHP的curl包装器: http//us.php.net/manual/en/book.curl.php

As far as dependencies, you could probably get a lot of those using some regular expressions that look for things like <script src="..."> , but a proper (X)HTML parser would let you more meaningfully traverse the DOM. 至于依赖关系,你可能得到很多那些使用一些寻找像<script src="...">这样的东西的正则表达式,但是一个正确的(X)HTML解析器可以让你更有意义地遍历DOM。

Perls Mechanize does this very well. Perls Mechanize非常好。 There is a library that does a similar task as mechanize but for PHP in the answer to this question: 有一个库可以执行类似于机械化的任务,但对于PHP来说,这个问题的答案是:

Is there a PHP equivalent of Perl's WWW::Mechanize? 是否有与Perl的WWW :: Mechanize相当的PHP?

I think most of the options are covered in SO questions about PHP and screen scraping. 我认为大多数选项都包含在关于PHP和屏幕抓取的SO问题中。

for example how to implement a web scraper in php or how do i implement a screen scraper in php 例如如何在PHP中实现Web scraper如何 在php中 实现屏幕scraper

I realise you want more than just a screen scraper, but I think these questions will answer yours. 我意识到你不仅仅需要一个屏幕刮刀,但我认为这些问题会回答你的问题。

屏幕抓取可能是您最好的答案。

What you would probably want to do is use SimpleXML to parse the HTML, and when you hit a 你可能想要做的是使用SimpleXML来解析HTML,当你点击时

<img>

or 要么

<script>

tag, read the SRC parameter and download that file. 标记,读取SRC参数并下载该文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM