使用scrapy从网站获取链接

Question

i am trying to extract the links from one class and store it using scrapy. 我试图从一个类中提取链接，并使用scrapy进行存储。 I am not really sure what's the problem. 我不太确定这是什么问题。 Here is the code: 这是代码：

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["craigslist.org"]
    start_urls = [
    "http://losangeles.craigslist.org/search/jjj"
]

def parse(self, response):
    for sel in response.xpath('//a[@class="hdrlnk"]'):
        item = DmozItem()
        item['link'] = sel.xpath('//a/@href').extract()

        yield item

CMD line CMD线

scrapy crawl dmoz -o items.csv -t csv

Any help is very appreciated, thanks in advance! 非常感谢您的任何帮助，在此先感谢！

Answer 1

I have updated the code with few things which were missing Check out: 我更新了代码，缺少了一些内容签出：

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector



class CompItem(scrapy.Item):
    link = scrapy.Field()




class criticspider(CrawlSpider):
    name = "craig"
    allowed_domains = ["losangeles.craigslist.org"]
    start_urls = ["http://losangeles.craigslist.org/search/jjj"]


    def parse_start_url(self, response):
        sites = response.xpath('//div[@class="content"]')
        items = []

        for site in sites:
            item = CompItem()
            item['link'] = site.xpath('.//p[@class="row"]/span/span[@class="pl"]/a/@href').extract()
            items.append(item)
            return items

Answer 2

If you are getting some error like the following 如果您遇到类似以下的错误

exceptions.NotImplementedError: exceptions.NotImplementedError：

seems like your parse() function is not intended properly. 似乎您的parse()函数的目的不正确。

I have slightly modified your code 我已经稍微修改了您的代码

# -*- coding: utf-8 -*-
import scrapy

# item class included here 
class DmozItem(scrapy.Item):
    # define the fields for your item here like:
    link = scrapy.Field()


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["craigslist.org"]
    start_urls = [
    "http://losangeles.craigslist.org/search/jjj"
    ]

    BASE_URL = 'http://losangeles.craigslist.org'

    def parse(self, response):
        links = response.xpath('//a[@class="hdrlnk"]/@href').extract()
        for link in links:
            absolute_url = self.BASE_URL + link
            item = DmozItem(link=absolute_url)
            yield item

this will give you only the first 100 results 这只会给您前100结果

a sample out will be, 样本将是

{'link': u'http://losangeles.craigslist.org/wst/web/5011899759.html'}

使用scrapy从网站获取链接

问题描述

2 个解决方案

解决方案1
0 2015-05-06 07:08:13

解决方案2
0 已采纳 2015-05-06 07:31:16

使用scrapy从网站获取链接

问题描述

2 个解决方案

解决方案1 0 2015-05-06 07:08:13

解决方案2 0 已采纳 2015-05-06 07:31:16

解决方案1
0 2015-05-06 07:08:13

解决方案2
0 已采纳 2015-05-06 07:31:16