[英]list index out of range - beautiful soup
NEW TO PYTHON*** Below is my code I am using to pull a zip file from a website but I am getting the error, "list index out of range". PYTHON 新手*** 下面是我用来从网站上提取 zip 文件的代码,但我收到错误消息“列表索引超出范围”。 I was given this code by someone else who wrote it but I had to change the URL and now I am getting the error.
编写此代码的其他人给了我此代码,但我不得不更改 URL,现在我收到错误消息。 When I print(list_of_documents) it is blank.
当我打印(list_of_documents)它是空白的。
Can someone help me with this?有人可以帮我弄这个吗? The url requires access so you won't be able to try to input this code directly.
url 需要访问权限,因此您将无法尝试直接输入此代码。 I am trying to understand how to use beautiful soup in this and how I can get the list to populate correctly.
我试图了解如何在其中使用漂亮的汤,以及如何让列表正确填充。
import datetime
import requests
import csv
from zipfile import ZipFile as zf
import os
import pandas as pd
import time
from bs4 import BeautifulSoup
import pyodbc
import re
#set download location
downloads_folder = r"C:\Scripts\"
##### Creating outage dataframe
#Get list of download links
res = requests.get('https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD')
ercot_soup = BeautifulSoup(res.text, "lxml")
list_of_documents = ercot_soup.findAll('td', attrs={'class': 'labelOptional_ind'})
list_of_links = ercot_soup.select('a')'
##create the url for the download
loc = str(list_of_links[0])[9:len(str(list_of_links[0]))-9]
link = 'http://www.ercot.com' + loc
link = link.replace('amp;','')
# Define file name and set download path
file_name = str(list_of_documents[0])[30:len(str(list_of_documents[0]))-5]
file_path = downloads_folder + '/' + file_name
You can't expect code tailored to scrape one website to work for a different link!您不能指望为抓取一个网站而定制的代码适用于不同的链接! You should always inspect and explore your target site, especially the parts you need to scrape, so you know the tag names [like
td
and a
here] and identifying attributes [like name
, id
, class
, etc.] of the elements you need to extract data from.您应该始终检查和探索您的目标站点,尤其是您需要抓取的部分,因此您知道标签名称 [如
td
和a
here] 和识别属性 [如name
、 id
、 class
等] 您需要的元素从中提取数据。
With this site, if you want the info from the reportTable
, it gets generated after the page gets loaded with javascript, so it wouldn't show up in the request
response.在这个站点中,如果您想要来自
reportTable
的信息,它会在页面加载 javascript 后生成,因此它不会显示在request
响应中。 You could either try something like Selenium , or you could try retrieving the data from the source itself.您可以尝试Selenium 之类的方法,也可以尝试从源本身检索数据。
If you inspect the site and look at the network tab , you'll find a request (which is what actually retrieves the data for the table) that looks like this , and when you inspect the table's html , you'll find above it the scripts to generate the data.如果您检查站点并查看网络选项卡,您会发现一个看起来像这样的请求(实际上是检索表的数据),当您检查表的 html时,您会在它上方找到生成数据的脚本。
In the suggested solution below, the getReqUrl
scrapes your link to get the url for requesting the reports (and also the template of the url for downloading the documents).在下面建议的解决方案中,
getReqUrl
会抓取您的链接以获取用于请求报告的 url(以及用于下载文档的 url 的模板)。
def getReqUrl(scrapeUrl):
res = requests.get(scrapeUrl)
ercot_soup = BeautifulSoup(res.text, "html.parser")
script = [l.split('"') for l in [
s for s in ercot_soup.select('script')
if 'reportListUrl' in s.text
and 'reportTypeID' in s.text
][0].text.split('\n') if l.count('"') == 2]
rtID = [l[1] for l in script if 'reportTypeID' in l[0]][0]
rlUrl = [l[1] for l in script if 'reportListUrl' in l[0]][0]
rdUrl = [l[1] for l in script if 'reportDownloadUrl' in l[0]][0]
return f'{rlUrl}{rtID}&_={int(time.time())}', rdUrl
(I couldn't figure out how to scrape the last query parameter [the &_=...
part] from the site exactly, but {int(time.time())}}
seems to get close enough - the results are the same even then and even when that last bit is omitted entirely; so it's totally optional.) (我不知道如何准确地从站点中抓取最后一个查询参数 [
&_=...
部分],但{int(time.time())}}
似乎已经足够接近 - 结果是即使在那时,甚至在完全省略最后一位时也是如此;所以它完全是可选的。)
The url returned can be used to request the documents:返回的 url 可用于请求文件:
#import json
url = 'https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD'
reqUrl, ddUrl = getReqUrl(url)
reqRes = requests.get(reqUrl[0]).text
rsJson = json.loads(reqRes)
for doc in rsJson['ListDocsByRptTypeRes']['DocumentList']:
d = doc['Document']
downloadLink = ddUrl+d['DocID']
#print(f"{d['FriendlyName']} {d['PublishDate']} {downloadLink}")
print(f"Download '{d['ConstructedName']}' at\n\t {downloadLink}")
print(len(rsJson['ListDocsByRptTypeRes']['DocumentList']))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.