[英]Scrapy - How to scrape content generated with javascript?
我正在嘗試在http://www.head-fi.org/f/6550/headphones-for-sale-trade上抓取一些分類廣告
我創建了一個蜘蛛,可以抓取標題,價格,說明等。它運行良好,但是我無法弄清楚該分頁在該特定網站上的工作方式。 我相信它是用JavaScript生成的? 由於網址不變。
這是我刮第一頁的代碼
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from headfi_headphones.items import HeadfiHeadphonesItem
class MySpider(CrawlSpider):
name = "headfiheadphones"
allowed_domains = ["head-fi.org"]
start_urls = ["http://www.head-fi.org/f/6550/headphones-for-sale-trade"]
#rules = (
# Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=("//a[@class='tooltip']",)), callback="parse_items", follow= True),
#)
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.xpath("//tr[@class='thread']")
items = []
for title in titles:
item = HeadfiHeadphonesItem()
item["title"] = title.select("td[@class='thread-col']/div[@class='shazam']/div[@class='thumbnail_body']/a[@class='classified-title']/text()").extract()
item["link"] = title.select("td[@class='thread-col']/div[@class='shazam']/div[@class='thumbnail_body']/a[@class='classified-title']/@href").extract()
item["img"] = title.select("td[@class='thread-col']/div[@class='shazam']/div[@class='thumbnail']/a[@class='thumb']/img/@src").extract()
item["saletype"] = title.select("td/strong/text()").extract()
item["price"] = title.select("td/div[@class='price']/span[@class='ctx-price']/text()").extract()
item["currency"] = title.select("td/div[@class='price']/span[@class='currency']/text()").extract()
items.append(item)
return items
它返回類似這樣的內容(我已經包含了一個條目)
{"img": ["http://cdn.head-fi.org/9/92/80x80px-ZC-9228072e_image.jpeg"], "title": ["Hifiman HE1000 Mint"], "saletype": ["For Sale"], "price": ["$2,000"], "currency": ["(USD)"], "link": ["/t/819200/hifiman-he1000-mint"]},
有沒有辦法通過我認為是javascript的方式來刮擦正在表中填充的每個頁面(1-80左右)?
為了正確地解析Javascript,您應該考慮使用selenium
。 該軟件包可在以下位置獲得: https : //pypi.python.org/pypi/selenium 。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.