繁体   English   中英

未在 scrapy 蜘蛛中列出网址

[英]Not making list of urls in scrapy spider

我创建了一个 scrapy 蜘蛛,它必须抓取整个网页并提取网址。 现在我必须删除社交媒体 URL 因为我想列出 URL,但不知何故它不起作用。 当我尝试在列表中列出 append 每个 URL 时,它只是不断地列出网址。

import re
import scrapy
all_urls = []

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    start_urls = [
            'https://www.wireshark.org/docs/dfref/i/ip.html',
        ]
    def parse(self, response):
        page= response.url.split("/")[-2]
        filename='quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        for r in response.css('a'):
            url = r.css('::attr(href)').get()
            print('all the urls are here', url)
            for i in url:
                all_urls.append(url)
                print(all_urls)

获取页面上所有 url 的更简单方法是链接您的 css 选择器并调用getall()

例如:

import scrapy
all_urls = []

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    start_urls = [
            'https://www.wireshark.org/docs/dfref/i/ip.html',
        ]
    def parse(self, response):
        for url in response.css('a::attr(href)').getall():
            all_urls.append(url)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM