如何獲取 Python Scrapy 以從 web 頁面中提取所有外部鏈接的所有域？

Question

我希望循環檢查每個鏈接 - 如果它轉到 output 它的外部域 - 目前它輸出所有鏈接（內部和外部）。 我搞砸了什么？ （為了測試，我已將代碼調整為僅從單個頁面運行，而不是爬取站點的 rest。）

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import re

class MySpider(CrawlSpider):
    name = 'crawlspider'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/BBC_News']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        item = dict()
        item['url'] = response.url
        item['title']=response.xpath('//title').extract_first()
        for link in LinkExtractor(allow=(),deny=self.allowed_domains).extract_links(response):
            item['links']=response.xpath('//a/@href').extract()
        return item

Answer 1

您的parse_item方法中的邏輯看起來不太正確

def parse_item(self, response):
    item = dict()
    item['url'] = response.url
    item['title']=response.xpath('//title').extract_first()
    for link in LinkExtractor(allow=(),deny=self.allowed_domains).extract_links(response):
        item['links']=response.xpath('//a/@href').extract()
    return item

您正在遍歷提取器中的每個link ，但始終將item["links"]設置為完全相同的內容（來自響應頁面的所有鏈接）。 我希望您嘗試將item["links"]設置為來自LinkExtractor的所有鏈接？ 如果是這樣，您應該將方法更改為

def parse_item(self, response):
    item = dict()
    item['url'] = response.url
    item['title'] = response.xpath('//title').extract_first()
    links = [link.url for link in LinkExtractor(deny=self.allowed_domains).extract_links(response)]        
    item['links'] = links
    return item

如果您真的只想要域，那么您可以使用urlparse中的urllib.parse來獲取netloc 。 您可能還想使用set刪除重復項。 所以你的解析方法會變成（最好在文件頂部導入）

def parse_item(self, response):
    from urllib.parse import urlparse
    item = dict()
    item["url"] = response.url
    item["title"] = response.xpath("//title").extract_first()
    item["links"] = {
        urlparse(link.url).netloc
        for link in LinkExtractor(deny=self.allowed_domains).extract_links(response)
    }   
    return item

如何獲取 Python Scrapy 以從 web 頁面中提取所有外部鏈接的所有域？

問題描述

1 個解決方案

解決方案1
1 已采納 2021-03-31 08:35:10

如何獲取 Python Scrapy 以從 web 頁面中提取所有外部鏈接的所有域？

問題描述

1 個解決方案

解決方案1 1 已采納 2021-03-31 08:35:10

解決方案1
1 已采納 2021-03-31 08:35:10