[英]how can i scrape data which gets generated after “load more” is clicked and url remains unchanged?
我試圖抓住一個類別或整個網站https://www.classcentral.com/subject下的所有課程。但是,該網站只顯示55個課程(包括廣告),你必須點擊加載更多按鈕它生成了50多個課程等等。我使用selenium點擊加載更多按鈕,然后調用parse_subject函數,以便生成加載課程的數據點。但是刮刀只能無限地抓取前55個課程。我如何讓刮刀刮掉下一組50個課程,而不是一次又一次地刮掉第一組,並繼續這樣做,直到沒有更多的課程?請幫助
這是“加載[總計]的下50個課程”的代碼
<button id="show-more-courses" class="btn-blue-outline width-14-16 medium-
up-width-1-2 btn--large margin-top-medium text-center" data-page="2"
style="" data-track-click="listing_click" data-track-props="{
"type": "Load More Courses", "page": "2" }}">
<span class="small-up-hidden text--bold">Load more</span>
<span class="hidden small-up-inline-block text--bold">
Load the next 50 courses of 1127
</span>
</button>
這是我的代碼
import scrapy
from scrapy.http import Request
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
class SubjectsSpider(scrapy.Spider):
name = 'subjects'
allowed_domains = ['class-central.com']
start_urls = ['http://class-central.com/subjects']
def __init__(self,subject=None):
self.subject=subject
def parse(self, response):
if self.subject:
print("True")
subject_url=response.xpath('//*[contains(@title, "'+ self.subject + '")]/@href').extract_first()
yield Request(response.urljoin(subject_url),callback=self.parse_subject,dont_filter=True)
else:
self.logger.info('Scraping all subjects')
subjects=response.xpath('//*[@class="unit-block unit-fill"]/a/@href').extract()
for subject in subjects:
self.logger.info(subject)
yield Request(response.urljoin(subject), callback=self.parse_subject,dont_filter=True)
def parse_subject(self,response):
subject_name=response.xpath('//title/text()').extract_first()
subject_name=subject_name.split(' | ') [0]
courses = response.xpath('//*[@itemtype="http://schema.org/Event"]')
for course in courses:
course_name = course.xpath('.//*[@itemprop="name"]/text()').extract_first()
course_url = course.xpath('.//*[@itemprop="url"]/@href').extract_first()
absolute_course_url = response.urljoin(course_url)
yield{
'subject_name':subject_name,
'course_name':course_name,
'absolute_course_url':absolute_course_url,
}
#for loading more courses
global driver #declared global so that browser window does not close after finishing request.
driver=webdriver.Chrome('C:/webdrivers/chromedriver')
driver.get(response.url)
print(driver.current_url)
try:
button_element = driver.find_element_by_id('show-more-courses')
#button_element.click()
driver.execute_script("arguments[0].click();",button_element)
yield Request(response.url,callback=self.parse_subject,dont_filter=True)
except NoSuchElementException:
pass
我相信如果沒有請求解決方案,您應該只使用selenium。 請求庫更快,更可靠。 這里有一些循環遍歷所有頁面的代碼,然后你可以使用Beautiful Soup來解析html。 如果需要,您需要在使用之前安裝Beautiful Soup。
import requests
from bs4 import BeautifulSoup
for page in range(1, 10): #Change 10 to however many times you need to press "Load Next 50 Courses"
params={'page': str(page)}
next_page = requests.get("https://www.classcentral.com/subject/cs", params=params)
soup = BeautifulSoup(next_page.text, 'html.parser')
#Parse through html
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.