未在 scrapy 蜘蛛中列出网址

Question

我创建了一个 scrapy 蜘蛛，它必须抓取整个网页并提取网址。 现在我必须删除社交媒体 URL 因为我想列出 URL，但不知何故它不起作用。 当我尝试在列表中列出 append 每个 URL 时，它只是不断地列出网址。

import re
import scrapy
all_urls = []

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    start_urls = [
            'https://www.wireshark.org/docs/dfref/i/ip.html',
        ]
    def parse(self, response):
        page= response.url.split("/")[-2]
        filename='quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        for r in response.css('a'):
            url = r.css('::attr(href)').get()
            print('all the urls are here', url)
            for i in url:
                all_urls.append(url)
                print(all_urls)

Answer 1

获取页面上所有 url 的更简单方法是链接您的 css 选择器并调用getall() 。

例如：

import scrapy
all_urls = []

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    start_urls = [
            'https://www.wireshark.org/docs/dfref/i/ip.html',
        ]
    def parse(self, response):
        for url in response.css('a::attr(href)').getall():
            all_urls.append(url)

未在 scrapy 蜘蛛中列出网址

问题描述

1 个解决方案

解决方案1
0 2022-08-25 19:28:47

未在 scrapy 蜘蛛中列出网址

问题描述

1 个解决方案

解决方案1 0 2022-08-25 19:28:47

解决方案1
0 2022-08-25 19:28:47