[英]Can't fetch all the titles from a webpage
我試圖從這個網頁遞歸解析所有類別及其嵌套類別,最終導致這樣的頁面,最后是這個最里面的頁面,我想從中獲取所有產品標題。
該腳本可以按照上述步驟操作。 但是,當涉及從遍歷所有下一頁的結果頁面中獲取所有標題時,腳本獲得的內容少於數量。
這是我寫的:
class mySpider(scrapy.Spider):
name = "myspider"
start_urls = ['https://www.phoenixcontact.com/online/portal/gb?1dmy&urile=wcm%3apath%3a/gben/web/main/products/subcategory_pages/Cables_P-10/e3a9792d-bafa-4e89-8e3f-8b1a45bd2682']
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}
def parse(self,response):
cookie = response.headers.getlist('Set-Cookie')[1].decode().split(";")[0]
for item in response.xpath("//div[./h3[contains(.,'Category')]]/ul/li/a/@href").getall():
item_link = response.urljoin(item.strip())
if "/products/list_pages/" in item_link:
yield scrapy.Request(item_link,headers=self.headers,meta={'cookiejar': cookie},callback=self.parse_all_links)
else:
yield scrapy.Request(item_link,headers=self.headers,meta={'cookiejar': cookie},callback=self.parse)
def parse_all_links(self,response):
for item in response.css("[class='pxc-sales-data-wrp'][data-product-key] h3 > a[href][onclick]::attr(href)").getall():
target_link = response.urljoin(item.strip())
yield scrapy.Request(target_link,headers=self.headers,meta={'cookiejar': response.meta['cookiejar']},callback=self.parse_main_content)
next_page = response.css("a.pxc-pager-next::attr(href)").get()
if next_page:
base_url = response.css("base::attr(href)").get()
next_page_link = urljoin(base_url,next_page)
yield scrapy.Request(next_page_link,headers=self.headers,meta={'cookiejar': response.meta['cookiejar']},callback=self.parse_all_links)
def parse_main_content(self,response):
item = response.css("h1::text").get()
print(item)
如何獲得該類別中的所有可用標題?
The script gets different number of results every time I run it.
您的主要問題是您需要為每個"/products/list_pages/"
使用單獨的cookiejar
才能正確獲取下一頁。 我為此使用了類變量cookie
(請參閱我的代碼)並多次獲得相同的結果(4293 項)。
這是我的代碼(我不下載產品頁面(只是從產品列表中讀取產品標題):
class mySpider(scrapy.Spider):
name = "phoenixcontact"
start_urls = ['https://www.phoenixcontact.com/online/portal/gb?1dmy&urile=wcm%3apath%3a/gben/web/main/products/subcategory_pages/Cables_P-10/e3a9792d-bafa-4e89-8e3f-8b1a45bd2682']
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}
cookie = 1
def parse(self,response):
# cookie = response.headers.getlist('Set-Cookie')[1].decode().split(";")[0]
for item in response.xpath("//div[./h3[contains(.,'Category')]]/ul/li/a/@href").getall():
item_link = response.urljoin(item.strip())
if "/products/list_pages/" in item_link:
cookie = self.cookie
self.cookie += 1
yield scrapy.Request(item_link,headers=self.headers,meta={'cookiejar': cookie},callback=self.parse_all_links, cb_kwargs={'page_number': 1})
else:
yield scrapy.Request(item_link,headers=self.headers,callback=self.parse)
def parse_all_links(self,response, page_number):
# if page_number > 1:
# with open("Samples/Page.htm", "wb") as f:
# f.write(response.body)
# for item in response.css("[class='pxc-sales-data-wrp'][data-product-key] h3 > a[href][onclick]::attr(href)").getall():
for item in response.xpath('//div[@data-product-key]//h3//a'):
target_link = response.urljoin(item.xpath('./@href').get())
item_title = item.xpath('./text()').get()
yield {'title': item_title}
# yield scrapy.Request(target_link,headers=self.headers,meta={'cookiejar': response.meta['cookiejar']},callback=self.parse_main_content)
next_page = response.css("a.pxc-pager-next::attr(href)").get()
if next_page:
base_url = response.css("base::attr(href)").get()
next_page_link = response.urljoin(next_page)
yield scrapy.Request(next_page_link,headers=self.headers,meta={'cookiejar': response.meta['cookiejar']},callback=self.parse_all_links, cb_kwargs={'page_number': page_number + 1})
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.