简体   繁体   English

尝试从 Python 中的一系列 URL 下载数据(文本)

[英]Trying to download data (text) from a range of URLs in Python

Sorry for the possibly dull question.抱歉这个可能很无聊的问题。 I am trying to download text from a range of URLs with Python all at once.我正在尝试使用 Python 从一系列 URL 中一次性下载文本。 They follow a very straightforward structure:它们遵循一个非常简单的结构:

" http://example.com/01000/01000/01000.htm "; " http://example.com/01000/01000/01000.htm "; " http://example.com/01000/01001/01001.htm "; " http://example.com/01000/01001/01001.htm ";

and so on, up to 01099.依此类推,直到 01099。

After getting the text, I would need to analyze it with the nltk toolkit.获取文本后,我需要使用 nltk 工具包对其进行分析。 I have tried to use wget on Windows, but did not work in command line.我曾尝试在 Windows 上使用 wget,但在命令行中不起作用。 I am wondering if there is a way, similar to glob module for URLs to download data from this range all at once.我想知道是否有一种方法,类似于用于 URL 的 glob 模块,可以一次性从此范围下载数据。

(There is also some blank URLs in the range.) (范围内还有一些空白 URL。)

Thanks a lot for your help.非常感谢你的帮助。

Once you've got the URL using string manipulation (seeing that you know the structure of the URL) you can use the Requests module一旦您使用字符串操作获得了 URL(看到您知道 URL 的结构),您就可以使用Requests 模块

Example;例子;

import requests

base_url = "http://example.com/01000/01001/0"
for i in range(1000, 1100):
    target_url = base_url + str(i) + ".htm"
    r = requests.get(target_url)

    print(r.text) # python 3 only

You could try my python3-wget module .你可以试试我的python3-wget 模块 Heres an example of use;这是一个使用示例;

#!/usr/bin/python3
#-*- coding:utf-8 -*-

import wget

urls = 'http://example.com/01000/01000/0'
for x in range(1000, 1099):
    url = urls + str(x) + '.htm' 
    filename = wget.download(url)

That will download all the files, if you need to exract specific text from the pages you will need to look into creating a simple web scraper with Requests and BeautifulSoup4.这将下载所有文件,如果您需要从页面中提取特定文本,您将需要考虑使用 Requests 和 BeautifulSoup4 创建一个简单的网络抓取工具。

Thanks a lot for your help.非常感谢你的帮助。 In the end, that's how my code looks like:最后,这就是我的代码的样子:

import requests
base_url = "http://example.com/01000/0"
for i in range(1000, 1100):
    target_url = base_url + str(i) + '/' + '0' + str(i) + ('.htm')
    r = requests.get(target_url)
    print(target_url)

    with open(str(i) + ".htm", 'w', encoding="iso-8859-1") as f:
    f.write(r.text)

 #The encoding is due to language specific text. 
#It downloaded all the files in the given range: http://example.com/01000/01000/01000.htm 
#to /01000/01099/01099.htm.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM