从外部网站下载数百万张图片

Question

I am working on a real estate website and we're about to get an external feed of ~1M listings. 我正在开发一个房地产网站，我们即将获得约1M个列表的外部供稿。 Assuming each listing has ~10 photos associated with it, that's about ~10M photos, and we're required to download each of them to our server so as to not "hot link" to them. 假设每个列表中有大约10张相关的照片，那就是大约10M张照片，我们需要将它们中的每一张下载到我们的服务器上，以免“热链接”到它们。

I'm at a complete loss as to how to do this efficiently. 我完全失去了如何有效地做到这一点。 I played with some numbers and I concluded, based on a 0.5 second per image download rate, this could take upwards of ~58 days to complete (download ~10M images from an external server). 我玩了一些数字，我得出结论，基于每个图像下载速率0.5秒，这可能需要超过58天才能完成（从外部服务器下载~10M图像）。 Which is obviously unacceptable. 这显然是不可接受的。

Each photo seems to be roughly ~50KB, but that can vary with some being larger, much larger, and some being smaller. 每张照片似乎大约约50KB，但这可能会有所不同，有些更大，更大，有些更小。

I've been testing by simply using: 我一直在测试，只需使用：

copy(http://www.external-site.com/image1.jpg, /path/to/folder/image1.jpg)

I've also tried cURL, wget, and others. 我也试过cURL，wget和其他人。

I know other sites do it, and at a much larger scale, but I haven't the slightest clue how they manage this sort of thing without it taking months at a time. 我知道其他网站都是这样做的，而且规模要大得多，但我没有丝毫的线索，他们如何管理这类事情而不需要花费几个月的时间。

Sudo code based on the XML feed we're set to receive. 基于我们设置接收的XML Feed的Sudo代码。 We're parsing the XML using PHP: 我们使用PHP解析XML：

<listing>
    <listing_id>12345</listing_id>
    <listing_photos>
        <photo>http://example.com/photo1.jpg</photo>
        <photo>http://example.com/photo2.jpg</photo>
        <photo>http://example.com/photo3.jpg</photo>
        <photo>http://example.com/photo4.jpg</photo>
        <photo>http://example.com/photo5.jpg</photo>
        <photo>http://example.com/photo6.jpg</photo>
        <photo>http://example.com/photo7.jpg</photo>
        <photo>http://example.com/photo8.jpg</photo>
        <photo>http://example.com/photo9.jpg</photo>
        <photo>http://example.com/photo10.jpg</photo>
    </listing_photos>
</listing>

So my script will iterate through each photo for a specific listing and download the photo to our server, and also insert the photo name into our photo database (the insert part is already done without issue). 因此，我的脚本将遍历每张照片以获取特定列表并将照片下载到我们的服务器，并将照片名称插入我们的照片数据库（插入部分已经完成而没有问题）。

Any thoughts? 有什么想法吗？

Answer 1

Before you do this 在你这样做之前

Like @BrokenBinar said in the comments. 就像@BrokenBinar在评论中所说的那样。 Take into account how many requests per second the host can provide. 考虑主机每秒可以提供多少请求。 You don't want to flood them with requests without them knowing. 你不想在没有他们知情的情况下将他们充满洪水。 Then use something like sleep to limit your requests per whatever number it is they can provide. 然后使用像睡眠这样的东西来限制你的请求，无论它们能提供什么数量。

Curl Multi 卷曲多

Anyway, use Curl. 无论如何，使用Curl。 Somewhat of a duplicate answer but copied anyway: 有些重复的答案，但仍然复制：

$nodes = array($url1, $url2, $url3);
$node_count = count($nodes);

$curl_arr = array();
$master = curl_multi_init();

for($i = 0; $i < $node_count; $i++)
{
    $url =$nodes[$i];
    $curl_arr[$i] = curl_init($url);
    curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
    curl_multi_add_handle($master, $curl_arr[$i]);
}

do {
    curl_multi_exec($master,$running);
} while($running > 0);


for($i = 0; $i < $node_count; $i++)
{
    $results[] = curl_multi_getcontent  ( $curl_arr[$i]  );
}
print_r($results);

From: PHP Parallel curl requests 来自： PHP并行curl请求

Another solution: 另一种方案：

Pthread 并行线程

<?php

class WebRequest extends Stackable {
    public $request_url;
    public $response_body;

    public function __construct($request_url) {
        $this->request_url = $request_url;
    }

    public function run(){
        $this->response_body = file_get_contents(
            $this->request_url);
    }
}

class WebWorker extends Worker {
    public function run(){}
}

$list = array(
    new WebRequest("http://google.com"),
    new WebRequest("http://www.php.net")
);

$max = 8;
$threads = array();
$start = microtime(true);

/* start some workers */
while (@$thread++<$max) {
    $threads[$thread] = new WebWorker();
    $threads[$thread]->start();
}

/* stack the jobs onto workers */
foreach ($list as $job) {
    $threads[array_rand($threads)]->stack(
        $job);
}

/* wait for completion */
foreach ($threads as $thread) {
    $thread->shutdown();
}

$time = microtime(true) - $start;

/* tell you all about it */
printf("Fetched %d responses in %.3f seconds\n", count($list), $time);
$length = 0;
foreach ($list as $listed) {
    $length += strlen($listed["response_body"]);
}
printf("Total of %d bytes\n", $length);
?>

Source: PHP testing between pthreads and curl 来源： pthreads和curl之间的PHP测试

You should really use the search feature, ya know :) 你应该真的使用搜索功能，你知道:)

Answer 2

You can save all links into some database table (it will be yours "job queue"), Then you can create a script which in the loop gets the job and do it (fetch image for a single link and mark job record as done) The script you can execute multiple times fe using supervisord. 您可以将所有链接保存到某个数据库表中（它将是您的“作业队列”），然后您可以创建一个在循环中获取作业并执行此操作的脚本（获取单个链接的图像并将作业记录标记为已完成）您可以使用supervisord多次执行脚本。 So the job queue will be processed in parallel. 因此，作业队列将被并行处理。 If it's to slow you can just execute another worker script (if bandwidth does not slow you down) 如果它变慢，你可以只执行另一个工作脚本（如果带宽不会减慢你的速度）

If any script hangs for some reason you can easly run it again to get only images that havnt been yet downloaded. 如果由于某种原因任何脚本挂起，您可以轻松地再次运行它以仅获取尚未下载的图像。 Btw supervisord can be configured to automaticaly restart each script if it fails. Btw supervisord可以配置为在每个脚本失败时自动重启。

Another advantage is that at any time you can check output of those scripts by supervisorctl. 另一个优点是，您可以随时通过supervisorctl检查这些脚本的输出。 To check how many images are still waiting you can easy query the "job queue" table. 要检查仍在等待的图像数量，您可以轻松查询“作业队列”表。

Answer 3

I am surprised the vendor is not allowing you to hot-link. 我很惊讶供应商不允许你热链接。 The truth is you will not serve every image every month so why download every image? 事实上，你不会每个月都为每一张图片服务，为什么要下载每张图片呢？ Allowing you to hot link is a better use of everyone's bandwidth. 允许您进行热链接可以更好地利用每个人的带宽。

I manage a catalog with millions of items where the data is local but the images are mostly hot linked. 我管理一个包含数百万个项目的目录，其中数据是本地的，但图像大多是热链接的。 Sometimes we need to hide the source of the image or the vendor requires us to cache the image. 有时我们需要隐藏图像的来源，或者供应商要求我们缓存图像。 To accomplish both goals we use a proxy. 为了实现这两个目标，我们使用代理。 We wrote our own proxy but you might find something open source that would meet your needs. 我们编写了自己的代理，但您可能会找到满足您需求的开源代码。

The way the proxy works is that we encrypt and URL encode the encrypted URL string. 代理的工作方式是我们对加密的URL字符串进行加密和URL编码。 So http://yourvendor.com/img1.jpg becomes xtX957z. 所以http://yourvendor.com/img1.jpg成为xtX957z。 In our markup the img src tag is something like http://ourproxy.com/getImage.ashx?image=xtX957z . 在我们的标记中，img src标记类似于http://ourproxy.com/getImage.ashx?image=xtX957z 。

When our proxy receives an image request, it decrypts the image URL. 当我们的代理收到图像请求时，它会解密图像URL。 The proxy first looks on disk for the image. 代理首先在磁盘上查找映像。 We derive the image name from the URL, so it is looking for something like yourvendorcom.img1.jpg. 我们从URL中获取图像名称，因此它正在寻找像yourvendorcom.img1.jpg这样的东西。 If the proxy cannot find the image on disk, then it uses the decrypted URL to fetch the image from the vendor. 如果代理无法在磁盘上找到该映像，则它会使用解密的URL从供应商处获取映像。 It then writes the image to disk and serves it back to the client. 然后它将映像写入磁盘并将其提供给客户端。 This approach has the advantage of being on demand with no wasted bandwidth. 这种方法的优点是可以在不浪费带宽的情况下按需提供。 I only get the images I need and I only get them once. 我只得到我需要的图像，我只得到它们一次。

从外部网站下载数百万张图片

问题描述

3 个解决方案

解决方案1
2 2015-01-16 19:45:19

Before you do this 在你这样做之前

Curl Multi 卷曲多

Another solution: 另一种方案：

Pthread 并行线程

解决方案2
2 2015-01-16 19:45:42

解决方案3
2 已采纳 2015-01-16 21:25:30

从外部网站下载数百万张图片

问题描述

3 个解决方案

解决方案1 2 2015-01-16 19:45:19

Before you do this 在你这样做之前

Curl Multi 卷曲多

Another solution: 另一种方案：

Pthread 并行线程

解决方案2 2 2015-01-16 19:45:42

解决方案3 2 已采纳 2015-01-16 21:25:30

解决方案1
2 2015-01-16 19:45:19

解决方案2
2 2015-01-16 19:45:42

解决方案3
2 已采纳 2015-01-16 21:25:30