Python：使用bs4解析UNICODE字符

Question

我正在使用bs4構建python3 Web爬網程序/爬蟲。 只要遇到UNICODE碼字符（如中文符號），程序就會崩潰。 如何修改我的刮板，使其支持UNICODE？

這是代碼：

import urllib.request
from bs4 import BeautifulSoup

def crawlForData(url):
        r = urllib.request.urlopen(url)
        soup = BeautifulSoup(r.read(),'html.parser')
        result = [i.text.replace('\n', ' ').strip() for i in soup.find_all('p')]
        for p in result:
                print(p)

url = 'https://en.wikipedia.org/wiki/Adivasi'
crawlForData(url)

Answer 1

您可以嘗試unicode()方法。 它解碼unicode字符串。

還是要走的路是

content.decode('utf-8','ignore')

content是您的字符串

完整的解決方案可能是：

html = urllib2.urlopen("your url")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)

Python：使用bs4解析UNICODE字符

問題描述

1 個解決方案

解決方案1
1 2016-01-05 10:54:44

Python：使用bs4解析UNICODE字符

問題描述

1 個解決方案

解決方案1 1 2016-01-05 10:54:44

解決方案1
1 2016-01-05 10:54:44