蟒蛇3。如何将下载的网页保存到指定的目录？

Question

I am trying to save all the < a > links within the python homepage into a folder named 'Downloaded pages'.我正在尝试将 python 主页中的所有 < a > 链接保存到名为“已下载页面”的文件夹中。 However after 2 iterations through the for loop I receive the following error:但是，通过 for 循环进行 2 次迭代后，我收到以下错误：

www.python.org#content <_io.BufferedWriter name='Downloaded Pages/www.python.org#content'> www.python.org#python-network <_io.BufferedWriter name='Downloaded Pages/www.python.org#python-network'> www.python.org#content <_io.BufferedWriter name='下载页面/www.python.org#content'> www.python.org#python-network <_io.BufferedWriter name='下载页面/www.python.org #python 网络'>

Traceback (most recent call last): File "/Users/Lucas/Python/AP book exercise/Web Scraping/linkVerification.py", line 26, in downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb') IsADirectoryError: [Errno 21] Is a directory: 'Downloaded Pages/' Traceback（最近一次调用最后一次）：文件“/Users/Lucas/Python/AP book exercise/Web Scraping/linkVerification.py”，第 26 行，indededPage = open(os.path.join('Downloaded Pages', os. path.basename(linkUrlToOpen)), 'wb') IsADirectoryError: [Errno 21] 是目录：'Downloaded Pages/'

I am unsure why this happens as it appears the pages are being saved as due to seeing '<_io.BufferedWriter name='Downloaded Pages/www.python.org#content'>', which says to me its the correct path.我不确定为什么会发生这种情况，因为看起来页面被保存是因为看到'<_io.BufferedWriter name='Downloaded Pages/www.python.org#content'>'，这对我来说是正确的路径。

This is my code:这是我的代码：

import requests, os, bs4

# Create a new folder to download webpages to
os.makedirs('Downloaded Pages', exist_ok=True)

# Download webpage
url = 'https://www.python.org/'
res = requests.get(url)
res.raise_for_status() # Check if the download was successful

soupObj = bs4.BeautifulSoup(res.text, 'html.parser') # Collects all text form the webpage

# Find all 'a' links on the webpage
linkElem = soupObj.select('a')
numOfLinks = len(linkElem)

for i in range(numOfLinks):
    linkUrlToOpen = 'https://www.python.org' + linkElem[i].get('href')
    print(os.path.basename(linkUrlToOpen))

    # save each downloaded page to the 'Downloaded pages' folder
    downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb')
    print(downloadedPage)
    if linkElem == []:
        print('Error, link does not work')
    else:
        for chunk in res.iter_content(100000):
            downloadedPage.write(chunk)
        downloadedPage.close()

Appreciate any advice, thanks.感谢任何建议，谢谢。

Answer 1

The problem is that when you try to do things like parse the basename of a page with an .html dir it works, but when you try to do it with one that doesn't specify it on the url like "http://python.org/" the basename is actually empty (you can try printing first the url and then the basename bewteen brackets or something to see what i mean).问题是，当您尝试使用 .html 目录解析页面的基本名称时，它可以工作，但是当您尝试使用未在 url 上指定它的页面时，例如“http://python .org/" 基本名称实际上是空的（您可以尝试先打印 url，然后在括号之间打印基本名称或其他内容以了解我的意思）。 So to work arround that, the easiest solution would be to use absolue paths like @Thyebri said.因此，为了解决这个问题，最简单的解决方案是使用@Thyebri 说的绝对路径。

And also, remember that the file you write cannot contain characters like '/', '\' or '?'另外，请记住，您编写的文件不能包含诸如'/', '\' or '?'类的字符。

So, i dont know if the following code it's messy or not, but using the re library i would do the following:所以，我不知道以下代码是否混乱，但使用re库我会执行以下操作：

filename = re.sub('[\/*:"?]+', '-', linkUrlToOpen.split("://")[1])
downloadedPage = open(os.path.join('Downloaded_Pages', filename), 'wb')

So, first i remove part i remove the "https://" part, and then with the regular expressions library i replace all the usual symbols that are present in url links with a dash '-' and that is the name that will be given to the file.因此，首先我删除部分，我删除"https://"部分，然后使用正则表达式库，我将 url 链接中存在的所有常用符号替换为破折号'-' ，这就是名称给文件。

Hope it works!希望它有效！

蟒蛇3。如何将下载的网页保存到指定的目录？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-05-21 19:25:51

蟒蛇3。 如何将下载的网页保存到指定的目录？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-05-21 19:25:51

蟒蛇3。如何将下载的网页保存到指定的目录？

解决方案1
1 已采纳 2022-05-21 19:25:51