简体   繁体   English

列表索引超出范围 - 美丽的汤

[英]list index out of range - beautiful soup

NEW TO PYTHON*** Below is my code I am using to pull a zip file from a website but I am getting the error, "list index out of range". PYTHON 新手*** 下面是我用来从网站上提取 zip 文件的代码,但我收到错误消息“列表索引超出范围”。 I was given this code by someone else who wrote it but I had to change the URL and now I am getting the error.编写此代码的其他人给了我此代码,但我不得不更改 URL,现在我收到错误消息。 When I print(list_of_documents) it is blank.当我打印(list_of_documents)它是空白的。

Can someone help me with this?有人可以帮我弄这个吗? The url requires access so you won't be able to try to input this code directly. url 需要访问权限,因此您将无法尝试直接输入此代码。 I am trying to understand how to use beautiful soup in this and how I can get the list to populate correctly.我试图了解如何在其中使用漂亮的汤,以及如何让列表正确填充。

import datetime
import requests
import csv
from zipfile import ZipFile as zf
import os
import pandas as pd
import time
from bs4 import BeautifulSoup
import pyodbc
import re

#set download location

downloads_folder = r"C:\Scripts\"


##### Creating outage dataframe

#Get list of download links

res = requests.get('https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD')

ercot_soup = BeautifulSoup(res.text, "lxml")

list_of_documents = ercot_soup.findAll('td', attrs={'class': 'labelOptional_ind'})
list_of_links = ercot_soup.select('a')'

##create the url for the download 

loc = str(list_of_links[0])[9:len(str(list_of_links[0]))-9]
link = 'http://www.ercot.com' + loc
link = link.replace('amp;','')

# Define file name and set download path

file_name = str(list_of_documents[0])[30:len(str(list_of_documents[0]))-5]
file_path = downloads_folder + '/' + file_name

You can't expect code tailored to scrape one website to work for a different link!您不能指望为抓取一个网站而定制的代码适用于不同的链接! You should always inspect and explore your target site, especially the parts you need to scrape, so you know the tag names [like td and a here] and identifying attributes [like name , id , class , etc.] of the elements you need to extract data from.您应该始终检查和探索您的目标站点,尤其是您需要抓取的部分,因此您知道标签名称 [如tda here] 和识别属性 [如nameidclass等] 您需要的元素从中提取数据。

With this site, if you want the info from the reportTable , it gets generated after the page gets loaded with javascript, so it wouldn't show up in the request response.在这个站点中,如果您想要来自reportTable的信息,它会在页面加载 javascript 后生成,因此它不会显示在request响应中。 You could either try something like Selenium , or you could try retrieving the data from the source itself.您可以尝试Selenium 之类的方法,也可以尝试从源本身检索数据。

If you inspect the site and look at the network tab , you'll find a request (which is what actually retrieves the data for the table) that looks like this , and when you inspect the table's html , you'll find above it the scripts to generate the data.如果您检查站点并查看网络选项卡,您会发现一个看起来像这样的请求(实际上是检索表的数据),当您检查表的 html时,您会在它上方找到生成数据的脚本。

In the suggested solution below, the getReqUrl scrapes your link to get the url for requesting the reports (and also the template of the url for downloading the documents).在下面建议的解决方案中, getReqUrl会抓取您的链接以获取用于请求报告的 url(以及用于下载文档的 url 的模板)。

def getReqUrl(scrapeUrl):
  res = requests.get(scrapeUrl)
  ercot_soup = BeautifulSoup(res.text, "html.parser")

  script = [l.split('"') for l in [
      s for s in ercot_soup.select('script')
      if 'reportListUrl' in  s.text 
      and 'reportTypeID' in s.text
  ][0].text.split('\n') if l.count('"') == 2]

  rtID = [l[1] for l in script if 'reportTypeID' in l[0]][0]
  rlUrl = [l[1] for l in script if 'reportListUrl' in l[0]][0]
  rdUrl = [l[1] for l in script if 'reportDownloadUrl' in l[0]][0]

  return f'{rlUrl}{rtID}&_={int(time.time())}', rdUrl

(I couldn't figure out how to scrape the last query parameter [the &_=... part] from the site exactly, but {int(time.time())}} seems to get close enough - the results are the same even then and even when that last bit is omitted entirely; so it's totally optional.) (我不知道如何准确地从站点中抓取最后一个查询参数 [ &_=...部分],但{int(time.time())}}似乎已经足够接近 - 结果是即使在那时,甚至在完全省略最后一位时也是如此;所以它完全是可选的。)

The url returned can be used to request the documents:返回的 url 可用于请求文件:

#import json

url = 'https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD'
reqUrl, ddUrl = getReqUrl(url)
reqRes = requests.get(reqUrl[0]).text
rsJson = json.loads(reqRes)

for doc in rsJson['ListDocsByRptTypeRes']['DocumentList']:
  d = doc['Document'] 
  downloadLink = ddUrl+d['DocID']
  #print(f"{d['FriendlyName']} {d['PublishDate']} {downloadLink}")
  print(f"Download '{d['ConstructedName']}' at\n\t {downloadLink}")
print(len(rsJson['ListDocsByRptTypeRes']['DocumentList']))

The print results will look like print结果看起来像在此处输入图像描述

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 美丽的汤,列表索引超出范围 - Beautiful soup, list index out of range 美丽汤的错误:列表索引超出范围 - Error with beautiful soup: list index out of range Beautiful Soup Web Scraper IndexError: list index out of range - Beautiful Soup Web Scraper IndexError: list index out of range Python Beautiful Soup错误:列表索引超出范围 - Python Beautiful Soup Error : list index out of range 为什么我得到“IndexError:列表索引超出范围”,在for循环期间,在美丽的汤解析中途? - Why am I getting “IndexError: list index out of range”, during for loop, midway through beautiful soup parse? 为什么会出现“ IndexError:列表索引超出范围”? (美丽汤) - Why do I get a “IndexError: list index out of range”? (Beautiful Soup) python webscraping期间的索引超出范围错误(美丽的汤) - Index out of range error during python webscraping (beautiful soup) “如何使用Beautiful Soup在嵌套HTML中找到正确的标签,如何接收超出范围错误或空列表的列表索引” - “How to find correct tags in nested HTML using Beautiful Soup, receiving list index out of range error or empty list” 使用汤时列出超出范围的错误。在美丽汤中选择('placeholder')[0] .get_text() - list out of range error when using soup.select('placeholder')[0].get_text() in Beautiful soup 列表索引超出范围错误:使用Beautifoul Soup进行网络抓取 - List index out of range error : webscraping with Beautifoul Soup
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM