简体   繁体   English

如何修复某些网站而非全部页面上在同一网站上工作的BeautifulSoup / Selenium?

[英]How to fix BeautifulSoup/selenium working on same website for some pages but not all?

I'm trying to scrape each page of: https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=&PlayerMovementChkBx=yes&submit=Search&start=0 我正在尝试抓取以下各页: https : //www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=&PlayerMovementChkBx= yes&submit = Search&start =0

Right now I have code that changes the URL iteratively. 现在,我有代码可以迭代地更改URL。 The URL is then passed into a selenium driver to grab the HTML content. 然后将URL传递到硒驱动程序中以获取HTML内容。 The content is then put into BeautifulSoup to process. 然后将内容放入BeautifulSoup中进行处理。 My problem is I get the following message randomly(Happens randomly on different pages which causes the program to crash. There's no consistent page that it fails on): 我的问题是我随机收到以下消息(随机出现在不同的页面上,这会导致程序崩溃。没有一致的页面会失败):

Traceback (most recent call last):
  File "scrape.py", line 89, in <module>
   i, i + 5000)
  File "scrape.py", line 37, in scrapeWebsite
    extractedInfo = info.findAll("td")
AttributeError: 'NoneType' object has no attribute 'findAll'

The i, i + 5000 is used to loop to update the pages iteratively so that's not important. i,i + 5000用于循环迭代地更新页面,因此这并不重要。

Here's the code that's doing the HTML grabbing: 这是进行HTML抓取的代码:

driver = webdriver.Chrome(executable_path='/Users/Downloads/chromedriver')
print(start, stop)


madeDict = {"Date": [], "Team": [], "Name": [], "Relinquished": [], "Notes": []}

#for i in range(0, 214025, 25):
for i in range(start, stop, 25):
    print("Current Page: " + str(i))
    currUrl = url + str(i)
    driver.get(currUrl)
    driver.implicitly_wait(100
    soupPage = BeautifulSoup(driver.page_source, 'html.parser')
    #page = urllib2.urlopen(currUrl)
    #soupPage = BeautifulSoup(page, 'html.parser')

    # #Sleep the program to ensure page is fully loaded
    # time.sleep(1)

    info = soupPage.find("table", attrs={'class': 'datatable center'})
    extractedInfo = info.findAll("td")

My guess is the page doesn't finish loading so when it tries to grab the content, the tags may not be there. 我的猜测是页面未完成加载,因此当它尝试获取内容时,标签可能不存在。 However, I thought Selenium prevented that issue with dynamic loading webpages to ensure the page is fully loaded before BeautifulSoup grabs the info. 但是,我认为Selenium阻止了动态加载网页时出现的问题,以确保在BeautifulSoup抓取信息之前页面已完全加载。 I was looking at other posts and some said I needed to wait the program for the page to dynamically load but I tried that and still the same error. 我在看其他帖子,有人说我需要等待程序才能动态加载页面,但是我尝试了该操作,并且仍然是相同的错误。

Executed without selenium, used requests instead. 在没有硒的情况下执行,而是使用请求。

import requests
from bs4 import BeautifulSoup

url='https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=&PlayerMovementChkBx=yes&submit=Search&start='

for i in range(0, 214025, 25):
    print("Current Page: " + str(i))
    r=requests.get(url + str(i))
    soup = BeautifulSoup(r.content)
    info = soup.find("table", attrs={'class': 'datatable center'})
    extractedInfo = info.findAll("td")
    print(extractedInfo)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM