python unicode char from requests/bs4

Question

i have a script to get the lyrics of a song from metrolyrics using requests and bs4

the problem is that when i print it it show something like this (part of the lyrics)

Rabbi, Papa, Allah, Lama, Imam, Bibbia, Dharma, Sura, Torah, Pane, Vino, KashÃ¨r, á¸¤alÄl, Yom Kippur, Quaresima, Ramadan

when it should look like this

Rabbi, Papa, Lama, Imam, Bibbia, Dharma, Sura, Torah, Pane, vino, kashèr, ḥalāl, Yom Kippur, Quaresima, Ramadan

code i use

import requests
from bs4 import BeautifulSoup
import os

try:
    from urllib.parse import quote_plus
except ImportError:
    from urllib import quote_plus

def get_lyrics(song_name):
    song_name += ' metrolyrics'
    name = quote_plus(song_name)
    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11'
           '(KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
           'Accept-Language': 'en-US,en;q=0.8',
           'Connection': 'keep-alive'}

    url = 'http://www.google.com/search?q=' + name

    result = requests.get(url, headers=hdr).text
    link_start = result.find('http://www.metrolyrics.com')

    if(link_start == -1):
        return("Lyrics not found on Metrolyrics")

    link_end = result.find('html', link_start + 1)
    link = result[link_start:link_end + 4]


    lyrics_html = requests.get(link, headers={
                               'User-Agent': 'Mozilla/5.0 (Macintosh; Intel'
                               'Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, '
                               'like Gecko) Chrome/55.0.2883.95 Safari/537.36'
                               }
                               ).text

    soup = BeautifulSoup(lyrics_html, "lxml")
    raw_lyrics = (soup.findAll('p', attrs={'class': 'verse'}))
    paras = []
    try:
        final_lyrics = unicode.join(u'\n', map(unicode, raw_lyrics))
    except NameError:
        final_lyrics = str.join(u'\n', map(str, raw_lyrics))

    final_lyrics = (final_lyrics.replace('<p class="verse">', '\n'))
    final_lyrics = (final_lyrics.replace('<br/>', ' '))
    final_lyrics = final_lyrics.replace('</p>', ' ')
    return (final_lyrics)

i have tried with .encode('utf-8') .encode('unicode-escape') and the reconverting again but no solution

i have another script where i use musixmatch api and there it show the unicode correct

Answer 1

I did small changes in get_lyrics function:

return final_lyrics.encode('latin1').decode('utf-8')

and got desired output:

# python2
print get_lyrics('kashèr')
...
Rabbi, Papa, Allah, Lama, Imam, Bibbia, Dharma, Sura, Torah, Pane, Vino, Kashèr, Ḥalāl, Yom Kippur, Quaresima, Ramadan
...

python unicode char from requests/bs4

Question

1 answers

solution1
1 2018-01-06 20:33:12

python unicode char from requests/bs4

Question

1 answers

solution1 1 2018-01-06 20:33:12

solution1
1 2018-01-06 20:33:12