尝试从 Python 中的一系列 URL 下载数据（文本）

Question

Sorry for the possibly dull question.抱歉这个可能很无聊的问题。 I am trying to download text from a range of URLs with Python all at once.我正在尝试使用 Python 从一系列 URL 中一次性下载文本。 They follow a very straightforward structure:它们遵循一个非常简单的结构：

" http://example.com/01000/01000/01000.htm "; " http://example.com/01000/01000/01000.htm "; " http://example.com/01000/01001/01001.htm "; " http://example.com/01000/01001/01001.htm ";

and so on, up to 01099.依此类推，直到 01099。

After getting the text, I would need to analyze it with the nltk toolkit.获取文本后，我需要使用 nltk 工具包对其进行分析。 I have tried to use wget on Windows, but did not work in command line.我曾尝试在 Windows 上使用 wget，但在命令行中不起作用。 I am wondering if there is a way, similar to glob module for URLs to download data from this range all at once.我想知道是否有一种方法，类似于用于 URL 的 glob 模块，可以一次性从此范围下载数据。

(There is also some blank URLs in the range.) （范围内还有一些空白 URL。）

Thanks a lot for your help.非常感谢你的帮助。

Answer 1

Once you've got the URL using string manipulation (seeing that you know the structure of the URL) you can use the Requests module一旦您使用字符串操作获得了 URL（看到您知道 URL 的结构），您就可以使用Requests 模块

Example;例子;

import requests

base_url = "http://example.com/01000/01001/0"
for i in range(1000, 1100):
    target_url = base_url + str(i) + ".htm"
    r = requests.get(target_url)

    print(r.text) # python 3 only

Answer 2

You could try my python3-wget module .你可以试试我的python3-wget 模块。 Heres an example of use;这是一个使用示例；

#!/usr/bin/python3
#-*- coding:utf-8 -*-

import wget

urls = 'http://example.com/01000/01000/0'
for x in range(1000, 1099):
    url = urls + str(x) + '.htm' 
    filename = wget.download(url)

That will download all the files, if you need to exract specific text from the pages you will need to look into creating a simple web scraper with Requests and BeautifulSoup4.这将下载所有文件，如果您需要从页面中提取特定文本，您将需要考虑使用 Requests 和 BeautifulSoup4 创建一个简单的网络抓取工具。

Answer 3

Thanks a lot for your help.非常感谢你的帮助。 In the end, that's how my code looks like:最后，这就是我的代码的样子：

import requests
base_url = "http://example.com/01000/0"
for i in range(1000, 1100):
    target_url = base_url + str(i) + '/' + '0' + str(i) + ('.htm')
    r = requests.get(target_url)
    print(target_url)

    with open(str(i) + ".htm", 'w', encoding="iso-8859-1") as f:
    f.write(r.text)

 #The encoding is due to language specific text. 
#It downloaded all the files in the given range: http://example.com/01000/01000/01000.htm 
#to /01000/01099/01099.htm.

尝试从 Python 中的一系列 URL 下载数据（文本）

问题描述

3 个解决方案

解决方案1
1 已采纳 2017-01-22 22:44:13

解决方案2
0 2017-01-23 13:05:21

解决方案3
0 2017-01-23 19:40:07

尝试从 Python 中的一系列 URL 下载数据（文本）

问题描述

3 个解决方案

解决方案1 1 已采纳 2017-01-22 22:44:13

解决方案2 0 2017-01-23 13:05:21

解决方案3 0 2017-01-23 19:40:07

解决方案1
1 已采纳 2017-01-22 22:44:13

解决方案2
0 2017-01-23 13:05:21

解决方案3
0 2017-01-23 19:40:07