[英]python unicode char from requests/bs4
我有一個腳本,可以使用request和bs4從Metrolyrics中獲取歌曲的歌詞
問題是,當我打印它時,它會顯示類似這樣的內容(歌詞的一部分)
Rabbi, Papa, Allah, Lama, Imam, Bibbia, Dharma, Sura, Torah, Pane, Vino, Kashèr, ḤalÄl, Yom Kippur, Quaresima, Ramadan
當它看起來像這樣
Rabbi, Papa, Lama, Imam, Bibbia, Dharma, Sura, Torah, Pane, vino, kashèr, ḥalāl, Yom Kippur, Quaresima, Ramadan
我使用的代碼
import requests
from bs4 import BeautifulSoup
import os
try:
from urllib.parse import quote_plus
except ImportError:
from urllib import quote_plus
def get_lyrics(song_name):
song_name += ' metrolyrics'
name = quote_plus(song_name)
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11'
'(KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
url = 'http://www.google.com/search?q=' + name
result = requests.get(url, headers=hdr).text
link_start = result.find('http://www.metrolyrics.com')
if(link_start == -1):
return("Lyrics not found on Metrolyrics")
link_end = result.find('html', link_start + 1)
link = result[link_start:link_end + 4]
lyrics_html = requests.get(link, headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel'
'Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, '
'like Gecko) Chrome/55.0.2883.95 Safari/537.36'
}
).text
soup = BeautifulSoup(lyrics_html, "lxml")
raw_lyrics = (soup.findAll('p', attrs={'class': 'verse'}))
paras = []
try:
final_lyrics = unicode.join(u'\n', map(unicode, raw_lyrics))
except NameError:
final_lyrics = str.join(u'\n', map(str, raw_lyrics))
final_lyrics = (final_lyrics.replace('<p class="verse">', '\n'))
final_lyrics = (final_lyrics.replace('<br/>', ' '))
final_lyrics = final_lyrics.replace('</p>', ' ')
return (final_lyrics)
我已經嘗試過.encode('utf-8')
.encode('unicode-escape')
和再次轉換,但沒有解決方案
我有另一個腳本,其中我使用musixmatch api,在那里顯示了正確的unicode
我在get_lyrics
函數中做了一些小的更改:
return final_lyrics.encode('latin1').decode('utf-8')
並獲得所需的輸出:
# python2
print get_lyrics('kashèr')
...
Rabbi, Papa, Allah, Lama, Imam, Bibbia, Dharma, Sura, Torah, Pane, Vino, Kashèr, Ḥalāl, Yom Kippur, Quaresima, Ramadan
...
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.