简体   繁体   English

Scrapy无法转到下一页

[英]Scrapy not able to go to next page

I am learning how to use Scrappy and am trying to make a crawler that scrapes website link and text from it. 我正在学习如何使用Scrappy,并试图制作一个从中抓取网站链接和文本的爬虫。 My crawler works for http://quotes.toscrape.com/ and http://books.toscrape.com/ but not for real life examples like https://pypi.org/project/wikipedia/ or Wikipedia. 我的搜寻器适用于http://quotes.toscrape.com/http://books.toscrape.com/,但不适用于https://pypi.org/project/wikipedia/或Wikipedia等现实示例。 I am not able to understand what is causing this. 我不明白是什么原因造成的。 Please help me 请帮我

MyCode mycode的

import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from scrapy.utils.log import configure_logging

class firstSpider(scrapy.Spider):
    name = "htmlcrawler"
    start_urls = [
        'https://pypi.org/project/wikipedia/',
    ]

    def parse(self, response):
        val1=response.css("p.text::text").extract_first()
        val2=response.css("span.text::text").extract_first()
        val3=response.css("pre.text::text").extract_first()
        text = str("" if val3 is None else val3) + str("" if val2 is None else val2)+str("" if val1 is None else val1)
        NEXT_PAGE_SELECTOR = '.next a ::attr(href)'
        next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
        print(next_page)
        if next_page:
           next_page = response.urljoin(next_page)
           yield{'html':next_page,'text':text}
           yield scrapy.Request(next_page, callback=self.parse)

def run():
    settings = get_project_settings()
    settings.set('FEED_FORMAT', 'json')
    settings.set('FEED_URI', 'result.json')
    settings.set('Depth_Limit',60)
    settings.set('DOWNLOAD_DELAY',2)
    settings.set('DUPEFILTER_CLASS','scrapy.dupefilters.BaseDupeFilter')

    configure_logging()
    runner = CrawlerRunner(settings)
    runner.crawl(firstSpider)

    d = runner.join()
    d.addBoth(lambda _: reactor.stop())

    reactor.run()
if __name__=="__main__":
    run()

I am running scrappy from Atom Hydrogen. 我正在从Atom Hydrogen运行草率。

Edit 编辑

I changed the dupe filter class and tried to make some changes to my link gatherer from https://blog.siliconstraits.vn/building-web-crawler-scrapy/ but it still isnt working . 我更改了dupe过滤器类,并尝试从https://blog.siliconstraits.vn/building-web-crawler-scrapy/对链接收集器进行一些更改,但仍无法正常工作。

It's crawling, but stops because you are sending requests for the same page ( #content ). 它正在爬网,但由于您正在发送对同一页面( #content )的请求而停止。

Scrapy has DupeFilter enabled by default. Scrapy默认情况下启用了DupeFilter

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM