简体   繁体   English

使用javascript自动滚动定期从网站上抓取并下载所有图像

[英]Periodically scrape and download all images from a website with javascript auto-scroll

I have found a website that has lot of high quality free images hosted on Tumblr(It says do whatever you want with theme images :P) 我发现一个网站上有很多高质量的免费图像托管在Tumblr上(它说你想用主题图片做任何事情:P)

I am running on Ubuntu 12.04LTS. 我在Ubuntu 12.04LTS上运行。 I need to write a script which will run periodically(say daily) and download only the images that were not downloaded earlier. 我需要编写一个定期运行的脚本(比如说每天)并只下载之前没有下载过的图像。

Additional Note : It has a javascript auto-scroller and the images gets downloaded when you reach the bottom of the page. 附加说明:它有一个javascript自动滚动器,当你到达页面底部时会下载图像。

First, you have to find out how the autoscrolling script works. 首先,您必须了解自动滚动脚本的工作原理。 The easiest way to do this is not to reverse-engineer the javascript, but to look at the network activity. 最简单的方法是不对javascript进行反向工程,而是查看网络活动。 Easiest way to do this is to use Firebug Firefox plugin and look at the activity in the "Net" panel. 最简单的方法是使用Firebug Firefox插件并查看“Net”面板中的活动。 You quickly see that the website is organized in pages: 您很快就会看到该网站是按页面组织的:

unsplash.com/page/1
unsplash.com/page/2
unsplash.com/page/3
...

When you scroll, the script requests to download succeeding pages. 滚动时,脚本会请求下载后续页面。

So, we can actually write a script to download all the pages, parse their html for all the images and download them. 因此,我们实际上可以编写一个脚本来下载所有页面,解析所有图像的html并下载它们。 If you look at the html code, you see that images are there in nice and unique form: 如果您查看html代码,您会看到图像以漂亮且独特的形式存在:

<a href="http://bit.ly/14nUvzx"><img src="http://31.media.tumblr.com/2ba914db5ce556ee7371e354b438133d/tumblr_mq7bnogm3e1st5lhmo1_1280.jpg" alt="Download &nbsp;/ &nbsp;By Tony&nbsp;Naccarato" title="http://unsplash.com/post/55904517579/download-by-tony-naccarato" class="photo_img" /></a>

The <a href contains URL of the full resolution image. <a href包含全分辨率图像的URL。 The title attribute contains a nice unique URL that also leads to the image. title属性包含一个很好的唯一URL,它也会导致图像。 We will use it to construct nice unique name for the image, much nicer than the one under which it is stored. 我们将使用它为图像构建漂亮的唯一名称,比存储图像的名称更好。 This nice unique name will also assure that no image is downloaded twice. 这个漂亮的唯一名称也将确保没有图像被下载两次。

Shell script (unsplash.sh) Shell脚本(unsplash.sh)

mkdir imgs
I=1
while true ; do # for all the pages
        wget unsplash.com/page/$I -O tmppage
        grep '<a href.*<img src.*title' tmppage > tmppage.imgs
        if [ ! -s tmppage.imgs ] ; then # empty page - end the loop
                break
        fi
        echo "Reading page $I:"
        sed 's/^.*<a href="\([^"]*\)".*title="\([^"]*\)".*$/\1 \2/' tmppage.imgs | while read IMG POST ; do
                # for all the images on page
                TARGET=imgs/`echo $POST | sed 's|.*post/\(.*\)$|\1|' | sed 's|/|_|g'`.jpg
                echo -n "Photo $TARGET: "
                if [ -f $TARGET ] ; then # we already have this image
                        echo "already have"
                        continue
                fi
                echo "downloading"
                wget $IMG -O $TARGET
        done
        I=$((I+1))
done

To make sure this runs every day.. 为了确保每天运行..

create a wrapper script usplash.cron : 创建一个包装脚本usplash.cron

#!/bin/bash

export PATH=... # might not be needed, but sometimes the PATH is not set 
                # correctly in cron-called scripts. Copy the PATH setting you 
                # normally see under console.

cd YOUR_DIRECTORY # the directory where the script and imgs directory is located

{
echo "========================"
echo -n "run unsplash.sh from cron "
date

./unsplash.sh 

} >> OUT.log 2>> ERR.log

Then add this line in your crontab (after issuing crontab -e on the console): 然后在crontab中添加此行(在控制台上发出crontab -e之后):

10 3 * * * PATH_to_the/unsplash.cron

This will run the script every day at 3:10. 这将在每天3:10运行脚本。

Here's a small python version of the download part. 这是下载部分的一个小python版本。 The getImageURLs function looks for the data from http://unsplash.com/page/X for lines which contain the word 'Download', and look for the image 'src' attribute there. getImageURLs函数在http://unsplash.com/page/X中查找包含单词“Download”的行的数据,并在那里查找图像的'src'属性。 It also looks for strings current_page and total_pages (are present in the javascript code) to find out how long to keep going. 它还查找字符串current_pagetotal_pages (存在于javascript代码中)以找出要继续运行的时间。

Currently, it retrieves all the URLs from all the pages first, and for these URLs the image is downloaded if the corresponding file does not exist locally. 目前,它首先从所有页面检索所有URL,对于这些URL,如果相应文件在本地不存在,则下载图像。 Depending on how the page numbering changes over time, it may be somewhat more efficient to stop looking for image URLs as soon as an local copy of a file has been found. 根据页码编号随时间的变化,一旦找到文件的本地副本,停止查找图像URL可能会更有效。 The files get stored in the directory in which the script was executed. 文件存储在执行脚本的目录中。

The other answer explains very well how to make sure something like this can get executed daily. 另一个答案很好地解释了如何确保每天都能执行这样的事情。

#!/usr/bin/env python

import urllib
import pprint
import os

def getImageURLs(pageIndex):
    f = urllib.urlopen('http://unsplash.com/page/' + str(pageIndex))
    data = f.read()
    f.close()

    curPage = None
    numPages = None
    imgUrls = [ ]

    for l in data.splitlines():
        if 'Download' in l and 'src=' in l:
            idx = l.find('src="')
            if idx >= 0:
                idx2 = l.find('"', idx+5)
                if idx2 >= 0:
                    imgUrls.append(l[idx+5:idx2])

        elif 'current_page = ' in l:
            idx = l.find('=')
            idx2 = l.find(';', idx)
            curPage = int(l[idx+1:idx2].strip())
        elif 'total_pages = ' in l:
            idx = l.find('=')
            idx2 = l.find(';', idx)
            numPages = int(l[idx+1:idx2].strip())

    return (curPage, numPages, imgUrls)

def retrieveAndSaveFile(fileName, url):
    f = urllib.urlopen(url)
    data = f.read()
    f.close()

    g = open(fileName, "wb")
    g.write(data)
    g.close()

if  __name__ == "__main__":

    allImages = [ ]
    done = False
    page = 1
    while not done:
        print "Retrieving URLs on page", page
        res = getImageURLs(page)
        allImages += res[2]

        if res[0] >= res[1]:
            done = True
        else:
            page += 1

    for url in allImages:
        idx = url.rfind('/')
        fileName = url[idx+1:]
        if not os.path.exists(fileName):
            print "File", fileName, "not found locally, downloading from", url
            retrieveAndSaveFile(fileName, url)

    print "Done."

The fantastic original script done by TMS no longer works with the new unsplash website. 由TMS完成的梦幻原创剧本不再适用于新的unsplash网站。 Here is an updated working version. 这是一个更新的工作版本。

#!/bin/bash
mkdir -p imgs
I=1
while true ; do # for all the pages
        wget "https://unsplash.com/grid?page=$I" -O tmppage

        grep img.*src.*unsplash.imgix.net tmppage | cut -d'?' -f1 | cut -d'"' -f2 > tmppage.imgs

        if [ ! -s tmppage.imgs ] ; then # empty page - end the loop
                break
        fi

        echo "Reading page $I:"
        cat tmppage.imgs | while read IMG; do

                # for all the images on page
                TARGET=imgs/$(basename "$IMG")

                echo -n "Photo $TARGET: "
                if [ -f $TARGET ] ; then # we already have this image
                        echo "file already exists"
                        continue
                fi
                echo -n "downloading (PAGE $I)"

                wget $IMG -O $TARGET
        done
        I=$((I+1))
done

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM