[英]Not making list of urls in scrapy spider
我创建了一个 scrapy 蜘蛛,它必须抓取整个网页并提取网址。 现在我必须删除社交媒体 URL 因为我想列出 URL,但不知何故它不起作用。 当我尝试在列表中列出 append 每个 URL 时,它只是不断地列出网址。
import re
import scrapy
all_urls = []
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'https://www.wireshark.org/docs/dfref/i/ip.html',
]
def parse(self, response):
page= response.url.split("/")[-2]
filename='quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
for r in response.css('a'):
url = r.css('::attr(href)').get()
print('all the urls are here', url)
for i in url:
all_urls.append(url)
print(all_urls)
获取页面上所有 url 的更简单方法是链接您的 css 选择器并调用getall()
。
例如:
import scrapy
all_urls = []
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'https://www.wireshark.org/docs/dfref/i/ip.html',
]
def parse(self, response):
for url in response.css('a::attr(href)').getall():
all_urls.append(url)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.