嘗試從網頁 Python 和 BeautifulSoup 中獲取編碼

Question

我試圖從網頁中檢索字符集（這會一直改變）。 目前我使用 beautifulSoup 來解析頁面，然后從標題中提取字符集。 這工作正常，直到我遇到一個網站.....

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

到目前為止，我的代碼與其他頁面一起工作的是：

    def get_encoding(soup):
        encod = soup.meta.get('charset')
        if encod == None:
            encod = soup.meta.get('content-type')
            if encod == None:
                encod = soup.meta.get('content')
    return encod

有沒有人知道如何添加到此代碼以從上述示例中檢索字符集。 將它標記化並嘗試以這種方式檢索字符集是一個想法嗎？ 您將如何在不更改整個功能的情況下進行操作？ 現在上面的代碼正在返回“text/html; charset=utf-8”，這會導致 LookupError 因為這是一種未知的編碼。

謝謝

我最終使用的最終代碼：

    def get_encoding(soup):
        encod = soup.meta.get('charset')
        if encod == None:
            encod = soup.meta.get('content-type')
            if encod == None:
                content = soup.meta.get('content')
                match = re.search('charset=(.*)', content)
                if match:
                    encod = match.group(1)
                else:
                    dic_of_possible_encodings = chardet.detect(unicode(soup))
                    encod = dic_of_possible_encodings['encoding'] 
    return encod

Answer 1

import re
def get_encoding(soup):
    if soup and soup.meta:
        encod = soup.meta.get('charset')
        if encod == None:
            encod = soup.meta.get('content-type')
            if encod == None:
                content = soup.meta.get('content')
                match = re.search('charset=(.*)', content)
                if match:
                    encod = match.group(1)
                else:
                    raise ValueError('unable to find encoding')
    else:
        raise ValueError('unable to find encoding')
    return encod

Answer 2

在我的情況下， soup.meta只返回在湯中找到的第一個meta soup.meta 。 這是@Fruit 的答案擴展到在給定的html中的任何meta標記中查找charset 。

from bs4 import BeautifulSoup
import re

def get_encoding(soup):
    encoding = None
    if soup:
        for meta_tag in soup.find_all("meta"):
            encoding = meta_tag.get('charset')
            if encoding: break
            else:
                encoding = meta_tag.get('content-type')
                if encoding: break
                else:
                    content = meta_tag.get('content')
                    if content:
                        match = re.search('charset=(.*)', content)
                        if match:
                           encoding = match.group(1)
                           break
    if encoding:
        # cast to str if type(encoding) == bs4.element.ContentMetaAttributeValue
        return str(encoding).lower()

soup = BeautifulSoup(html)
print(get_encoding_from_meta(soup))

嘗試從網頁 Python 和 BeautifulSoup 中獲取編碼

問題描述

2 個解決方案

解決方案1
4 已采納 2013-08-21 13:48:04

解決方案2
0 2021-04-20 12:22:51

嘗試從網頁 Python 和 BeautifulSoup 中獲取編碼

問題描述

2 個解決方案

解決方案1 4 已采納 2013-08-21 13:48:04

解決方案2 0 2021-04-20 12:22:51

解決方案1
4 已采納 2013-08-21 13:48:04

解決方案2
0 2021-04-20 12:22:51