如何使用 PHP 和 Python 从网站下载文件

Question

我有一个 Python 脚本，它抓取各种网站并从它们下载文件。 我的问题是，有些网站似乎在使用 PHP，至少这是我的理论，因为 URL 看起来像这样： https://www.portablefreeware.com/download.ZE1BFD762321E409CEE4AC0B163

问题是我无法从这样的链接中获取任何文件名或结尾，因此无法保存文件。 目前我只保存网址。

有没有办法获得链接后面的实际文件名？

这是我精简的下载代码：

r = requests.get(url, allow_redirects=True)

file = open("name.something", 'wb')
file.write(r.content)
file.close()

免责声明：我从未对 PHP 做过任何工作，因此请原谅我对此的任何不正确的术语或理解。 我很高兴能学到更多

Answer 1

您正在使用下载requests 。 这不适用于此类下载。

改用urllib ：

import urllib.request

urllib.request.urlretrieve(url, filepath)

Answer 2

import requests
import mimetypes

response = requests.get('https://www.portablefreeware.com/download.php?dd=1159')
content_type = response.headers['Content-Type']
#extension = mimetypes.guess_extension(content_type)
print(content_type)

Answer 3

import urllib.request
urllib.request.urlretrieve("name.something", "name.something")

Answer 4

您应该使用allow_redirects=False选项来检索Location header，其中包含实际下载 URL：

import requests

url = 'https://www.portablefreeware.com/download.php?dd=1159'
r = requests.get(url, allow_redirects=False)
print(r.headers['Location'])

这输出：

https://www.diskinternals.com/download/Linux_Reader.exe

演示： https://replit.com/@blhsing/TrivialLightheartedLists

然后，您可以使用os.path.basename获取将内容写入到的文件的名称：

import os

with open(os.path.basename(r.headers['Location']), 'w') as file:
    file.write(r.content)

Answer 5

您可以从响应 header 下载文件名 get 的文件。

这是我的下载代码，带有进度条和块大小缓冲区：

要显示进度条，请使用 tqdm。 pip install tqdm
其中，chunk write用于在下载时保存memory。

import os

import requests
import tqdm
url = "https://www.portablefreeware.com/download.php?dd=1159"
response_header = requests.head(url)
file_path = response_header.headers["Location"]
file_name = os.path.basename(file_path)
with open(file_name, "wb") as file:
    response = requests.get(url, stream=True)
    total_length = int(response.headers.get("content-length"))
    for chunk in tqdm.tqdm(response.iter_content(chunk_size=1024), total=total_length / 1024, unit="KB"):
        if chunk:
            file.write(chunk)
            file.flush()

进展output：

6%|▌         | 2848/46100.1640625 [00:04<01:11, 606.90KB/s]

Answer 6

可重定向可以通过 DNS 分布式网络在任何地方反弹。 所以上面的示例答案显示 https://www 但在我的情况下，它们将被解析到欧洲，所以我最快的本地来源是

https://eu.diskinternals.com/download/Linux_Reader.exe

到目前为止，最简单的是先生 curl，如果它很好，不需要检查或刮擦

无需费心去解决任何事情，
curl -o 1159.tmp https://www.portablefreeware.com/download.php?dd=1159

但是我知道在这种情况下这不是预期的结果，所以下一个级别是

curl -I https://www.portablefreeware.com/download.php?dd=1159 |find "Location"

并给出了其他人所显示的结果
https://www.diskinternals.com/download/Linux_Reader.exe但这不是更全面的情况，因为如果我们反馈的话

curl.exe -K location.txt

我们得到

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://eu.diskinternals.com/download/Linux_Reader.exe">here</a>.</p>
</body></html>

因此嵌套重定向到

https://eu.diskinternals.com/download/Linux_Reader.exe

所有这些都可以是命令行脚本以在一两行中循环运行，但我不使用 Python 所以你可能需要写十几行来做类似的事情

如何使用 PHP 和 Python 从网站下载文件

问题描述

5 个解决方案

解决方案1
0 2020-06-23 11:12:54

解决方案2
0 2022-08-31 10:31:22

解决方案3
0 2022-09-01 23:49:57

解决方案4
0 2022-09-02 08:17:17

解决方案5
0 2022-09-04 06:17:46

解决方案6
0 2022-09-04 23:13:25

如何使用 PHP 和 Python 从网站下载文件

问题描述

5 个解决方案

解决方案1 0 2020-06-23 11:12:54

解决方案2 0 2022-08-31 10:31:22

解决方案3 0 2022-09-01 23:49:57

解决方案4 0 2022-09-02 08:17:17

解决方案5 0 2022-09-04 06:17:46

解决方案6 0 2022-09-04 23:13:25

解决方案1
0 2020-06-23 11:12:54

解决方案2
0 2022-08-31 10:31:22

解决方案3
0 2022-09-01 23:49:57

解决方案4
0 2022-09-02 08:17:17

解决方案5
0 2022-09-04 06:17:46

解决方案6
0 2022-09-04 23:13:25