Remove non BMP characters from BeautifulSoup object

Question

I am a beginner in python. I am using BeautifulSoup to extract data from websites. But whenever the source code of a page contains emoticons, my program stops there. What should I exactly do while/before parsing so that , emoticons/non BMP characters are removed and the page is scraped.

import bs4 as bs
import string
import urllib.request

str = 'http://www.storypick.com/harshad-mehta-scam-web-series/' #myurl
source = urllib.request.urlopen(str);
soup = bs.BeautifulSoup(source,'lxml');

match=soup.find('div',class_='td-post-content');
str=soup.title.text+"\n";
name=soup.title.text;
for paragraph in match.find_all(['p' , 'h4' , 'h3' , 'h2' , 'blockquote']):
    str+=paragraph.text+"\n";
print(str);

Output:

UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 161-161: Non-BMP character not supported in Tk

Answer 1

I switched to using requests which makes things simpler. This is a simpler example than what you are trying to do, but it does work. You should have no problems finishing your script now.

import requests
from bs4 import BeautifulSoup

requestURL = 'http://www.storypick.com/harshad-mehta-scam-web-series'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}

with requests.Session() as session:
    r = session.get(requestURL, headers=headers)
    if r.ok:
        soup = BeautifulSoup(r.content, 'lxml')
        for paragraph in soup.find_all('p'):
            print (paragraph)

Answer 2

Working perfectly for me ! I modified the code a little

import bs4 as bs
import string
import urllib

str = 'http://www.storypick.com/harshad-mehta-scam-web-series/' #myurl
source = urllib.urlopen(str);
soup = bs.BeautifulSoup(source);

match=soup.find('div',class_='td-post-content');
str=soup.title.text+"\n";
name=soup.title.text;
for paragraph in match.find_all(['p' , 'h4' , 'h3' , 'h2' , 'blockquote']):
    str+=paragraph.text+"\n";
print(str);

Remove non BMP characters from BeautifulSoup object

Question

2 answers

solution1
0 2018-03-03 15:41:49

solution2
0 2018-03-03 18:37:56

Remove non BMP characters from BeautifulSoup object

Question

2 answers

solution1 0 2018-03-03 15:41:49

solution2 0 2018-03-03 18:37:56

solution1
0 2018-03-03 15:41:49

solution2
0 2018-03-03 18:37:56