簡體   English   中英

從BeautifulSoup對象中刪除非BMP字符

[英]Remove non BMP characters from BeautifulSoup object

我是python的初學者。 我正在使用BeautifulSoup從網站提取數據。 但是只要頁面的源代碼包含表情符號,我的程序就會在那里停止。 在解析之前/之前,我該怎么做才能刪除表情符號/非BMP字符並抓取頁面。

import bs4 as bs
import string
import urllib.request

str = 'http://www.storypick.com/harshad-mehta-scam-web-series/' #myurl
source = urllib.request.urlopen(str);
soup = bs.BeautifulSoup(source,'lxml');

match=soup.find('div',class_='td-post-content');
str=soup.title.text+"\n";
name=soup.title.text;
for paragraph in match.find_all(['p' , 'h4' , 'h3' , 'h2' , 'blockquote']):
    str+=paragraph.text+"\n";
print(str);

輸出:

UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 161-161: Non-BMP character not supported in Tk

我轉而使用請求 ,這使事情變得更簡單。 這是一個比您嘗試做的例子更簡單的示例,但是它確實起作用。 現在完成腳本應該沒有問題。

import requests
from bs4 import BeautifulSoup

requestURL = 'http://www.storypick.com/harshad-mehta-scam-web-series'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}

with requests.Session() as session:
    r = session.get(requestURL, headers=headers)
    if r.ok:
        soup = BeautifulSoup(r.content, 'lxml')
        for paragraph in soup.find_all('p'):
            print (paragraph)

為我完美地工作! 我修改了一下代碼

import bs4 as bs
import string
import urllib

str = 'http://www.storypick.com/harshad-mehta-scam-web-series/' #myurl
source = urllib.urlopen(str);
soup = bs.BeautifulSoup(source);

match=soup.find('div',class_='td-post-content');
str=soup.title.text+"\n";
name=soup.title.text;
for paragraph in match.find_all(['p' , 'h4' , 'h3' , 'h2' , 'blockquote']):
    str+=paragraph.text+"\n";
print(str);

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM