简体   繁体   English

Scrapy-如何抓取用javascript生成的内容?

[英]Scrapy - How to scrape content generated with javascript?

I'm trying to scrape some classified ads on http://www.head-fi.org/f/6550/headphones-for-sale-trade 我正在尝试在http://www.head-fi.org/f/6550/headphones-for-sale-trade上抓取一些分类广告

I created a spider which can scrape the titles, prices, descriptions, etc. It is working well but I can't figure out how the pagination works on that specific website. 我创建了一个蜘蛛,可以抓取标题,价格,说明等。它运行良好,但是我无法弄清楚该分页在该特定网站上的工作方式。 I believe it is being generated with javascript? 我相信它是用JavaScript生成的? Since the URL doesn't change. 由于网址不变。

This is my code to scraping the first page 这是我刮第一页的代码

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from headfi_headphones.items import HeadfiHeadphonesItem

class MySpider(CrawlSpider):
    name = "headfiheadphones"
    allowed_domains = ["head-fi.org"]
    start_urls = ["http://www.head-fi.org/f/6550/headphones-for-sale-trade"]

    #rules = (
    #    Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=("//a[@class='tooltip']",)), callback="parse_items", follow= True),
    #)

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.xpath("//tr[@class='thread']")
    items = []
    for title in titles:
        item = HeadfiHeadphonesItem()
        item["title"] = title.select("td[@class='thread-col']/div[@class='shazam']/div[@class='thumbnail_body']/a[@class='classified-title']/text()").extract()
        item["link"] = title.select("td[@class='thread-col']/div[@class='shazam']/div[@class='thumbnail_body']/a[@class='classified-title']/@href").extract()
        item["img"] = title.select("td[@class='thread-col']/div[@class='shazam']/div[@class='thumbnail']/a[@class='thumb']/img/@src").extract()
        item["saletype"] = title.select("td/strong/text()").extract()
        item["price"] = title.select("td/div[@class='price']/span[@class='ctx-price']/text()").extract()
        item["currency"] = title.select("td/div[@class='price']/span[@class='currency']/text()").extract()
        items.append(item)
    return items

It returns something like this (I've included one entry) 它返回类似这样的内容(我已经包含了一个条目)

{"img": ["http://cdn.head-fi.org/9/92/80x80px-ZC-9228072e_image.jpeg"], "title": ["Hifiman HE1000 Mint"], "saletype": ["For Sale"], "price": ["$2,000"], "currency": ["(USD)"], "link": ["/t/819200/hifiman-he1000-mint"]},

Is there a way to scrape through each page (1-80 or so) which is being populated on a table by what I assume is javascript? 有没有办法通过我认为是javascript的方式来刮擦正在表中填充的每个页面(1-80左右)?

To properly parse Javascript you should consider using selenium . 为了正确地解析Javascript,您应该考虑使用selenium The package is available here: https://pypi.python.org/pypi/selenium . 该软件包可在以下位置获得: https : //pypi.python.org/pypi/selenium

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM