![](/img/trans.png)
[英]How to read and append multiple cells from different excel files in python ?( in a simpler way)?)
[英]Is there a better, simpler way to download multiple files?
我在紐約市MTA網站上下載了一些旋轉柵門數據,並想出了一個腳本來僅在Python上下載2017年數據。
這是腳本:
import urllib
import re
html = urllib.urlopen('http://web.mta.info/developers/turnstile.html').read()
links = re.findall('href="(data/\S*17[01]\S*[a-z])"', html)
for link in links:
txting = urllib.urlopen('http://web.mta.info/developers/'+link).read()
lin = link[20:40]
fhand = open(lin,'w')
fhand.write(txting)
fhand.close()
有沒有更簡單的方法來編寫此腳本?
按照@dizzyf的建議,您可以使用BeautifulSoup從網頁中獲取href
值。
from BS4 import BeautifulSoup
soup = BeautifulSoup(html)
links = [link.get('href') for link in soup.find_all('a')
if 'turnstile_17' in link.get('href')]
如果您不必使用Python獲取文件(並且您正在使用wget
命令在系統上),則可以將鏈接寫入文件:
with open('url_list.txt','w') as url_file:
for url in links:
url_file.writeline(url)
然后使用wget
下載它們:
$ wget -i url_list.txt
wget -i
文件中的所有URL下載到當前目錄中,並保留文件名。
下面的代碼應滿足您的需求。
import requests
import bs4
import time
import random
import re
pattern = '2017'
url_base = 'http://web.mta.info/developers/'
url_home = url_base + 'turnstile.html'
response = requests.get(url_home)
data = dict()
soup = bs4.BeautifulSoup(response.text)
links = [link.get('href') for link in soup.find_all('a',
text=re.compile('2017'))]
for link in links:
url = url_base + link
print "Pulling data from:", url
response = requests.get(url)
data[link] = response.text # I don't know what you want to do with the data so here I just store it to a dict, but you could store it to a file as you did in your example.
not_a_robot = random.randint(2, 15)
print "Waiting %d seconds before next query." % not_a_robot
time.sleep(not_a_robot) # some APIs will throttle you if you hit them too quickly
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.