i have a script to get the lyrics of a song from metrolyrics using requests and bs4
the problem is that when i print it it show something like this (part of the lyrics)
Rabbi, Papa, Allah, Lama, Imam, Bibbia, Dharma, Sura, Torah, Pane, Vino, Kashèr, ḤalÄl, Yom Kippur, Quaresima, Ramadan
when it should look like this
Rabbi, Papa, Lama, Imam, Bibbia, Dharma, Sura, Torah, Pane, vino, kashèr, ḥalāl, Yom Kippur, Quaresima, Ramadan
code i use
import requests
from bs4 import BeautifulSoup
import os
try:
from urllib.parse import quote_plus
except ImportError:
from urllib import quote_plus
def get_lyrics(song_name):
song_name += ' metrolyrics'
name = quote_plus(song_name)
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11'
'(KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
url = 'http://www.google.com/search?q=' + name
result = requests.get(url, headers=hdr).text
link_start = result.find('http://www.metrolyrics.com')
if(link_start == -1):
return("Lyrics not found on Metrolyrics")
link_end = result.find('html', link_start + 1)
link = result[link_start:link_end + 4]
lyrics_html = requests.get(link, headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel'
'Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, '
'like Gecko) Chrome/55.0.2883.95 Safari/537.36'
}
).text
soup = BeautifulSoup(lyrics_html, "lxml")
raw_lyrics = (soup.findAll('p', attrs={'class': 'verse'}))
paras = []
try:
final_lyrics = unicode.join(u'\n', map(unicode, raw_lyrics))
except NameError:
final_lyrics = str.join(u'\n', map(str, raw_lyrics))
final_lyrics = (final_lyrics.replace('<p class="verse">', '\n'))
final_lyrics = (final_lyrics.replace('<br/>', ' '))
final_lyrics = final_lyrics.replace('</p>', ' ')
return (final_lyrics)
i have tried with .encode('utf-8')
.encode('unicode-escape')
and the reconverting again but no solution
i have another script where i use musixmatch api and there it show the unicode correct
I did small changes in get_lyrics
function:
return final_lyrics.encode('latin1').decode('utf-8')
and got desired output:
# python2
print get_lyrics('kashèr')
...
Rabbi, Papa, Allah, Lama, Imam, Bibbia, Dharma, Sura, Torah, Pane, Vino, Kashèr, Ḥalāl, Yom Kippur, Quaresima, Ramadan
...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.