[英]Python web-scraping and downloading specific zip files in Windows
我正在嘗試在網頁上下載並傳輸特定zip文件的內容。
該網頁具有標簽和指向使用表結構的zip文件的鏈接,如下所示:
Filename Flag Link
testfile_20190725_csv.zip Y zip
testfile_20190725_xml.zip Y zip
testfile_20190724_csv.zip Y zip
testfile_20190724_xml.zip Y zip
testfile_20190723_csv.zip Y zip
testfile_20190723_xml.zip Y zip
(etc.)
上方的“ zip”一詞是zip文件的鏈接。 我只想下載CSV壓縮文件,而只下載頁面上顯示的前一個x(例如7),但不下載XML壓縮文件。
網頁代碼示例如下:
<tr>
<td class="labelOptional_ind">
testfile_20190725_csv.zip
</td>
</td>
<td class="labelOptional" width="15%">
<div align="center">
Y
</div>
</td>
<td class="labelOptional" width="15%">
<div align="center">
<a href="/test1/servlets/mbDownload?doclookupId=671334586">
zip
</a>
</div>
</td>
</tr>
<tr>
<td class="labelOptional_ind">
testfile_20190725_xml.zip
</td>
<td class="labelOptional" width="15%">
<div align="center">
N
</div>
</td>
<td class="labelOptional" width="15%">
<div align="center">
<a href="/test1/servlets/mbDownload?doclookupId=671190392">
zip
</a>
</div>
</td>
</tr>
<tr>
<td class="labelOptional_ind">
testfile_20190724_csv.zip
</td>
<td class="labelOptional" width="15%">
<div align="center">
我想我快到了,但是需要一點幫助。 到目前為止,我已經能夠做的是:1.檢查是否存在本地下載文件夾,如果不存在則創建它。2.設置BeautifulSoup,從網頁上讀取所有主要標簽(表格的第一列) ,並讀取所有zip鏈接-即“ a hrefs” 3.為了進行測試,請手動將變量設置為標簽之一,將另一個變量手動設置為對應的zip文件鏈接,下載文件並傳輸zip文件的CSV內容
我需要幫助的是:下載所有主要標簽及其對應的鏈接,然后遍歷每個標簽,跳過任何XML標簽/鏈接,並僅下載/流式傳輸CSV標簽/鏈接。
這是我的代碼:
# Read zip files from page, download file, extract and stream output
from io import BytesIO
from zipfile import ZipFile
import urllib.request
import os,sys,requests,csv
from bs4 import BeautifulSoup
# check for download directory existence; create if not there
if not os.path.isdir('f:\\temp\\downloaded'):
os.makedirs('f:\\temp\\downloaded')
# Get labels and zip file download links
mainurl = "http://www.test.com/"
url = "http://www.test.com/thisapp/GetReports.do?Id=12331"
# get page and setup BeautifulSoup
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
# Get all file labels and filter so only use CSVs
mainlabel = soup.find_all("td", {"class": "labelOptional_ind"})
for td in mainlabel:
if "_csv" in td.text:
print(td.text)
# Get all <a href> urls
for link in soup.find_all('a'):
print(mainurl + link.get('href'))
# QUESTION: HOW CAN I LOOP THROUGH ALL FILE LABELS AND FIND ONLY THE
# CSV LABELS AND THEIR CORRESPONDING ZIP DOWNLOAD LINK, SKIPPING ANY
# XML LABELS/LINKS, THEN LOOP AND EXECUTE THE CODE BELOW FOR EACH,
# REPLACING zipfilename WITH THE MAIN LABEL AND zipurl WITH THE ZIP
# DOWNLOAD LINK?
# Test downloading and streaming
zipfilename = 'testfile_20190725_xml.zip'
zipurl = 'http://www.test.com/thisdownload/servlets/thisDownload?doclookupId=674992379'
outputFilename = "f:\\temp\\downloaded\\" + zipfilename
# Unzip and stream CSV file
url = urllib.request.urlopen(zipurl)
zippedData = url.read()
# Save zip file to disk
print ("Saving to ",outputFilename)
output = open(outputFilename,'wb')
output.write(zippedData)
output.close()
# Unzip and stream CSV file
with ZipFile(BytesIO(zippedData)) as my_zip_file:
for contained_file in my_zip_file.namelist():
with open(("unzipped_and_read_" + contained_file + ".file"), "wb") as output:
for line in my_zip_file.open(contained_file).readlines():
print(line)
為了獲得所有必需的鏈接,可以將find_all()
方法與自定義函數一起使用。 該函數將搜索<td>
標記,其文本以"csv.zip"
結尾。
data
是來自以下問題的HTML代碼段:
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for td in soup.find_all(lambda tag: tag.name=='td' and tag.text.strip().endswith('csv.zip')):
link = td.find_next('a')
print(td.get_text(strip=True), link['href'] if link else '')
印刷品:
testfile_20190725_csv.zip /test1/servlets/mbDownload?doclookupId=671334586
testfile_20190724_csv.zip
您可以捕獲整行,檢查標簽是否為csv
,然后使用URL下載它,而不用為標簽和URL創建兩個單獨的列表。
# Using the class name to identify the correct labels
mainlabel = soup.find_all("td", {"class": "labelOptional_ind"})
# find the containing row <tr> for each label
fullrows = [label.find_parent('tr') for label in mainlabel]
現在,您可以使用以下方法測試標簽並下載文件:
for row in fullrows:
if "_csv" in row.text:
print(mainurl + row.find('a').get('href')) # download this!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.