Python从页面上的链接下载多个文件

Question

我正在尝试从此站点下载所有PGN 。

我想我必须使用urlopen打开每个 url，然后使用urlretrieve通过从每个游戏底部附近的下载按钮访问它来下载每个 pgn。 我是否必须为每个游戏创建一个新的BeautifulSoup对象？ 我也不确定urlretrieve是如何工作的。

import urllib
from urllib.request import urlopen, urlretrieve, quote
from bs4 import BeautifulSoup

url = 'http://www.chessgames.com/perl/chesscollection?cid=1014492'
u = urlopen(url)
html = u.read().decode('utf-8')

soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a'):
    urlopen('http://chessgames.com'+link.get('href'))

Answer 1

你的问题没有简短的答案。 我将向您展示一个完整的解决方案并注释此代码。

首先，导入必要的模块：

from bs4 import BeautifulSoup
import requests
import re

接下来，获取索引页面并创建BeautifulSoup对象：

req = requests.get("http://www.chessgames.com/perl/chesscollection?cid=1014492")
soup = BeautifulSoup(req.text, "lxml")

我强烈建议使用lxml解析器，而不是常见的html.parser之后，您应该准备游戏的链接列表：

pages = soup.findAll('a', href=re.compile('.*chessgame\?.*'))

您可以通过搜索包含“chessgame”字样的链接来实现。 现在，您应该准备为您下载文件的功能：

def download_file(url):
    path = url.split('/')[-1].split('?')[0]
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        with open(path, 'wb') as f:
            for chunk in r:
                f.write(chunk)

最后的魔法是重复之前的所有步骤，为文件下载器准备链接：

host = 'http://www.chessgames.com'
for page in pages:
    url = host + page.get('href')
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "lxml")
    file_link = soup.find('a',text=re.compile('.*download.*'))
    file_url = host + file_link.get('href')
    download_file(file_url)

（首先搜索描述中包含文本“下载”的链接，然后构建完整的 url - 连接主机名和路径，最后下载文件）

我希望您可以使用此代码而无需更正！

Answer 2

接受的答案很棒，但任务令人尴尬地并行； 无需一次检索这些子页面和文件。 这个答案显示了如何加快速度。

第一步是在向单个主机发送多个请求时使用requests.Session() 。 从requests文档中引用高级用法：会话对象：

Session 对象允许您跨请求保留某些参数。 它还在从 Session 实例发出的所有请求中保留 cookie，并将使用urllib3的连接池。 因此，如果您向同一主机发出多个请求，底层 TCP 连接将被重用，这可能会导致性能显着提升（请参阅HTTP 持久连接）。

接下来，异步、多处理或多线程可用于并行化工作负载。 每个都针对手头的任务进行权衡，您选择的可能最好通过基准测试和分析来确定。 此页面提供了所有这三个的很好的例子。

出于本文的目的，我将展示多线程。 GIL 的影响不应该是太大的瓶颈，因为任务大多是 IO 绑定的，包括空中的保姆请求以等待响应。 当一个线程在 IO 上被阻塞时，它可以让一个线程解析 HTML 或执行其他 CPU 密集型工作。

这是代码：

import os
import re
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

def download_pgn(task):
    session, host, page, destination_path = task
    response = session.get(host + page)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "lxml")
    game_url = host + soup.find("a", text="download").get("href")
    filename = re.search(r"\w+\.pgn", game_url).group()
    path = os.path.join(destination_path, filename)
    response = session.get(game_url, stream=True)
    response.raise_for_status()

    with open(path, "wb") as f:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)

if __name__ == "__main__":
    host = "http://www.chessgames.com"
    url_to_scrape = host + "/perl/chesscollection?cid=1014492"
    destination_path = "pgns"
    max_workers = 8

    if not os.path.exists(destination_path):
        os.makedirs(destination_path)

    with requests.Session() as session:
        response = session.get(url_to_scrape)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "lxml")
        pages = soup.find_all("a", href=re.compile(r".*chessgame\?.*"))
        tasks = [
            (session, host, page.get("href"), destination_path) 
            for page in pages
        ]

        with ThreadPoolExecutor(max_workers=max_workers) as pool:
            pool.map(download_pgn, tasks)

我在这里使用了response.iter_content这在如此小的文本文件上是不必要的，但它是一种概括，因此代码将以内存友好的方式处理更大的文件。

粗略基准测试的结果（第一个请求是瓶颈）：

最大工人数	会议？	秒
1	没有	126
1	是的	111
8	没有	24
8	是的	22
32	是的	16

Python从页面上的链接下载多个文件

问题描述

2 个解决方案

解决方案1
6 已采纳 2017-09-18 09:13:02

解决方案2
4 2021-02-01 16:43:24

Python从页面上的链接下载多个文件

问题描述

2 个解决方案

解决方案1 6 已采纳 2017-09-18 09:13:02

解决方案2 4 2021-02-01 16:43:24

解决方案1
6 已采纳 2017-09-18 09:13:02

解决方案2
4 2021-02-01 16:43:24