简体   繁体   English

Python Scrapy无法抓取网站

[英]Python Scrapy does not crawl website

I am new to python scrapy and trying to get through a small example, however I am having some problems! 我是python scrapy的新手,并尝试通过一个小例子,但是我遇到了一些问题! I am able to crawl the first given URL only, but I am unable to crawl more than one page or an entire website for that matter! 我只能抓取第一个给定的URL,但是我不能抓取一个以上的页面或整个网站!

Please help me or give me some advice on how I can crawl an entire website or more pages in general... 请帮助我,或者给我一些建议,让我大致上可以爬行整个网站或更多页面...

The example I am doing is very simple... My items.py 我正在执行的示例非常简单... My items.py

import scrapy
    class WikiItem(scrapy.Item):
        title = scrapy.Field()

my wikip.py (the spider) 我的Wikip.py(蜘蛛)

import scrapy
from wiki.items import WikiItem

class CrawlSpider(scrapy.Spider):
    name = "wikip"
    allowed_domains = ["en.wikipedia.org/wiki/"]
    start_urls = (
        'http://en.wikipedia.org/wiki/Portal:Arts',
    )

    def parse(self, response):
        for sel in response.xpath('/html'):
            item = WikiItem()
            item['title'] = sel.xpath('//h1[@id="firstHeading"]/text()').extract()
            yield item

When I run scrapy crawl wikip -o data.csv in the root project diretory the result is: 当我在根项目目录中运行scrapy crawl wikip -o data.csv时 ,结果是:

title

Portal:Arts

Can anyone give me insight as to why it is not following urls and crawling deeper? 谁能给我我的见解,为什么它不跟随URL并更深入地爬行?

I have checked some related SO questions but they have not helped to solve the issue 我已经检查了一些相关的SO问题,但它们并没有帮助解决问题

scrapy.Spider is the simplest spider. 蜘蛛是最简单的蜘蛛。 Change the name CrawlSpider, since Crawl Spider is one of the generic spiders of scrapy. 更改名称CrawlSpider,因为Crawl Spider是scrapy的通用蜘蛛之一。

One of the below option can be used: 可以使用以下选项之一:

eg: 1. class WikiSpider(scrapy.Spider) 例如:1. class WikiSpider(scrapy.Spider)

or 2. class WikiSpider(CrawlSpider) 或2. class WikiSpider(CrawlSpider)

If you are using first option you need to code the logic for following the links you need to follow on that webpage. 如果使用的是第一选项,则需要编写逻辑代码以跟踪需要在该网页上访问的链接。

For second option you can do the below: 对于第二个选项,您可以执行以下操作:

After the start urls you need to define the rule as below: 在起始网址之后,您需要定义以下规则:

rules = ( Rule(LinkExtractor(allow=('https://en.wikipedia.org/wiki/Portal:Arts\\?.*?')), callback='parse_item', follow=True,), )

Also please change the name of the function defined as " parse " if you use CrawlSpider. 另外,如果您使用CrawlSpider,请更改定义为“ parse ”的函数的名称。 The Crawl Spider uses parse method to implement the logic. 爬网蜘蛛使用解析方法来实现逻辑。 Thus, here you are trying to override the parse method and hence the crawl spider doesn't work. 因此,在这里您尝试覆盖parse方法,因此抓取蜘蛛不起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM