用 SCRAPY 抓取特殊字符

Question

我正在用丹麥語刮一頁。 我在使用 output 時遇到問題。 output 包含許多特殊字符，例如(Ã¥, Ã, Ã¥, Ã¦) ，它與頁面上的不同。

我怎樣才能像在頁面上一樣刮掉文字？

示例鏈接： https://novaindex.com/dk/leverandoerer/mode-og-tekstiler/arbejdstoej

import scrapy
    
class MainSpider(scrapy.Spider):
    name = 'main'

    start_urls = ['https://novaindex.com/dk/leverandoerer/mode-og-tekstiler/arbejdstoej']

    def parse(self, response):

        details = response.xpath('//a[@class="companyresult "]')

        for each in details:
            name = each.xpath('normalize-space(.//span[@class="name"]/text())').get()
            street = each.xpath('normalize-space(.//span[@class="street"]/text())').get()
            city = each.xpath('normalize-space(.//span[@class="city"]/text())').get()
            phone = each.xpath('normalize-space(.//span[@class="phone"]/text())').get()

            yield {
                "Name": name,
                "Street Address": street,
                "City Address": city,
                "Phone": phone,
            }

Answer 1

您可以在get()或getall()之后添加.encode('utf8') )

Scrapy 將數據提取為 unicode 字符串，這可能有助於您了解 Abit 關於 unicode 和 ZAE3B3DF9970B59B6523E6078

什么是 unicode 字符串？

Answer 2

丹麥編解碼器是cp865 在此處查看所有可用的編解碼器

注意：僅當您抓取英文網站時才使用ascii 。

def string_cleaner(rouge_text):
    return ("".join(rouge_text.strip()).encode('cp865', 'ignore').decode("cp865"))

使用ignore忽略錯誤

用法

 yield {
                "Name": string_cleaner(name),
                ...
            }

關於代碼的更多解釋檢查我的代碼分解here

用 SCRAPY 抓取特殊字符

問題描述

2 個解決方案

解決方案1
0 2020-07-29 13:58:33

解決方案2
0 已采納 2020-07-30 09:16:03

用 SCRAPY 抓取特殊字符

問題描述

2 個解決方案

解決方案1 0 2020-07-29 13:58:33

解決方案2 0 已采納 2020-07-30 09:16:03

解決方案1
0 2020-07-29 13:58:33

解決方案2
0 已采納 2020-07-30 09:16:03