使用Python / urllib / beautifulsoup从URL批量下载文本和图像？

Question

我一直在浏览这里的几篇帖子，但我无法用Python从批量下载图片和文本来给定URL。

import urllib,urllib2
import urlparse
from BeautifulSoup import BeautifulSoup
import os, sys

def getAllImages(url):
    query = urllib2.Request(url)
    user_agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)"
    query.add_header("User-Agent", user_agent)

    page = BeautifulSoup(urllib2.urlopen(query))
    for div in page.findAll("div", {"class": "thumbnail"}):
        print "found thumbnail"
        for img in div.findAll("img"):
            print "found image"
            src = img["src"]
            if src:
                src = absolutize(src, pageurl)
                f = open(src,'wb')
                f.write(urllib.urlopen(src).read())
                f.close()
        for h5 in div.findAll("h5"):
            print "found Headline"
            value = (h5.contents[0])
            print >> headlines.txt, value


def main():
    getAllImages("http://www.nytimes.com/")

以上是一些更新的代码。 会发生什么，什么都不是。 代码没有找到任何带有缩略图的div，显然，没有任何结果打印....所以我可能错过了一些指针来获取包含图像和标题的正确div？

非常感谢！

Answer 1

您正在使用的操作系统不知道如何写入您在src中传递它的文件路径。 确保用于将文件保存到磁盘的名称是操作系统实际可以使用的名称：

src = "abc.com/alpha/beta/charlie.jpg"
with open(src, "wb") as f:
    # IOError - cannot open file abc.com/alpha/beta/charlie.jpg

src = "alpha/beta/charlie.jpg"
os.makedirs(os.path.dirname(src))
with open(src, "wb" as f:
    # Golden - write file here

一切都会开始奏效。

还有一些额外的想法：

确保规范化保存文件路径（例如os.path.join(some_root_dir, *relative_file_path*) ） - 否则你将根据他们的src在整个硬盘上写图像。
除非你正在运行某种类型的测试，否则最好在你的user_agent字符串中宣传你是一个机器人，并尊重robots.txt文件（或者，提供某种联系信息，以便人们可以要求你在需要时停止）。

使用Python / urllib / beautifulsoup从URL批量下载文本和图像？

问题描述

1 个解决方案

解决方案1
1 已采纳 2011-10-27 16:54:32

使用Python / urllib / beautifulsoup从URL批量下载文本和图像？

问题描述

1 个解决方案

解决方案1 1 已采纳 2011-10-27 16:54:32

解决方案1
1 已采纳 2011-10-27 16:54:32