简体   繁体   中英

Python Scrapy does not crawl website

I am new to python scrapy and trying to get through a small example, however I am having some problems! I am able to crawl the first given URL only, but I am unable to crawl more than one page or an entire website for that matter!

Please help me or give me some advice on how I can crawl an entire website or more pages in general...

The example I am doing is very simple... My items.py

import scrapy
    class WikiItem(scrapy.Item):
        title = scrapy.Field()

my wikip.py (the spider)

import scrapy
from wiki.items import WikiItem

class CrawlSpider(scrapy.Spider):
    name = "wikip"
    allowed_domains = ["en.wikipedia.org/wiki/"]
    start_urls = (
        'http://en.wikipedia.org/wiki/Portal:Arts',
    )

    def parse(self, response):
        for sel in response.xpath('/html'):
            item = WikiItem()
            item['title'] = sel.xpath('//h1[@id="firstHeading"]/text()').extract()
            yield item

When I run scrapy crawl wikip -o data.csv in the root project diretory the result is:

title

Portal:Arts

Can anyone give me insight as to why it is not following urls and crawling deeper?

I have checked some related SO questions but they have not helped to solve the issue

scrapy.Spider is the simplest spider. Change the name CrawlSpider, since Crawl Spider is one of the generic spiders of scrapy.

One of the below option can be used:

eg: 1. class WikiSpider(scrapy.Spider)

or 2. class WikiSpider(CrawlSpider)

If you are using first option you need to code the logic for following the links you need to follow on that webpage.

For second option you can do the below:

After the start urls you need to define the rule as below:

rules = ( Rule(LinkExtractor(allow=('https://en.wikipedia.org/wiki/Portal:Arts\\?.*?')), callback='parse_item', follow=True,), )

Also please change the name of the function defined as " parse " if you use CrawlSpider. The Crawl Spider uses parse method to implement the logic. Thus, here you are trying to override the parse method and hence the crawl spider doesn't work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM