Scrapy：努力实现爬行蜘蛛

Question

I've been trying to implement a web crawler to scrape titles and points off hacker news website.我一直在尝试实现一个网络爬虫来抓取标题并指出黑客新闻网站。 I had success with parsing it through using the normal scrapy.spider class.我通过使用普通的 scrapy.spider 类成功解析了它。 However, I'd like to have a robust way of crawling through links using link extractor.但是，我想要一种使用链接提取器来抓取链接的强大方法。 Here's my current setup:这是我目前的设置：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class QuotesSpider(CrawlSpider):
    name = "crawl"
    allowed_domains = ['news.ycombinator.com']

    start_urls = [
        'https://news.ycombinator.com/news?p=2',
    ]


    rules = [
    Rule(LinkExtractor(allow=r'news?p=[3-9]'), callback='parse_news', follow=True)
]
 
    def parse_news(self, response):

        data = {}
        title = response.xpath("//td/a[@class='storylink']/text()").getall()
        point = response.xpath("//td[@class='subtext']/span/text()").getall()
        length = len(title)

        for each in range(length):
            data["title"] = title[each]
            data["point"] = point[each]
            yield data

I can't seem to get any information saved to a json after running this though.运行此程序后，我似乎无法将任何信息保存到 json 中。

Answer 1

Your code has a lot of errors, but for the first step, you have to fix the LinkExtractor:您的代码有很多错误，但第一步，您必须修复 LinkExtractor：

Rule(LinkExtractor(allow=r'news\?p=[3-9]'), callback='parse_news', follow=True)

The question mark is an especial character in regular expressions, so you have to put a \\ before it.问号是正则表达式中的一个特殊字符，所以你必须在它前面加上一个\\ 。 Next, you have to fix the data extraction process in your for loop.接下来，您必须修复for循环中的数据提取过程。

Scrapy：努力实现爬行蜘蛛

问题描述

1 个解决方案

解决方案1
0 2020-03-29 04:00:31

Scrapy：努力实现爬行蜘蛛

问题描述

1 个解决方案

解决方案1 0 2020-03-29 04:00:31

解决方案1
0 2020-03-29 04:00:31