簡體   English   中英

來自requests / bs4的python unicode char

[英]python unicode char from requests/bs4

我有一個腳本,可以使用request和bs4從Metrolyrics中獲取歌曲的歌詞

問題是,當我打印它時,它會顯示類似這樣的內容(歌詞的一部分)

Rabbi, Papa, Allah, Lama, Imam, Bibbia, Dharma, Sura, Torah, Pane, Vino, Kashèr, ḤalÄl, Yom Kippur, Quaresima, Ramadan

當它看起來像這樣

Rabbi, Papa, Lama, Imam, Bibbia, Dharma, Sura, Torah, Pane, vino, kashèr, ḥalāl, Yom Kippur, Quaresima, Ramadan

我使用的代碼

import requests
from bs4 import BeautifulSoup
import os

try:
    from urllib.parse import quote_plus
except ImportError:
    from urllib import quote_plus

def get_lyrics(song_name):
    song_name += ' metrolyrics'
    name = quote_plus(song_name)
    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11'
           '(KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
           'Accept-Language': 'en-US,en;q=0.8',
           'Connection': 'keep-alive'}

    url = 'http://www.google.com/search?q=' + name

    result = requests.get(url, headers=hdr).text
    link_start = result.find('http://www.metrolyrics.com')

    if(link_start == -1):
        return("Lyrics not found on Metrolyrics")

    link_end = result.find('html', link_start + 1)
    link = result[link_start:link_end + 4]


    lyrics_html = requests.get(link, headers={
                               'User-Agent': 'Mozilla/5.0 (Macintosh; Intel'
                               'Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, '
                               'like Gecko) Chrome/55.0.2883.95 Safari/537.36'
                               }
                               ).text

    soup = BeautifulSoup(lyrics_html, "lxml")
    raw_lyrics = (soup.findAll('p', attrs={'class': 'verse'}))
    paras = []
    try:
        final_lyrics = unicode.join(u'\n', map(unicode, raw_lyrics))
    except NameError:
        final_lyrics = str.join(u'\n', map(str, raw_lyrics))

    final_lyrics = (final_lyrics.replace('<p class="verse">', '\n'))
    final_lyrics = (final_lyrics.replace('<br/>', ' '))
    final_lyrics = final_lyrics.replace('</p>', ' ')
    return (final_lyrics)

我已經嘗試過.encode('utf-8') .encode('unicode-escape')和再次轉換,但沒有解決方案

我有另一個腳本,其中我使用musixmatch api,在那里顯示了正確的unicode

我在get_lyrics函數中做了一些小的更改:

return final_lyrics.encode('latin1').decode('utf-8')

並獲得所需的輸出:

# python2
print get_lyrics('kashèr')
...
Rabbi, Papa, Allah, Lama, Imam, Bibbia, Dharma, Sura, Torah, Pane, Vino, Kashèr, Ḥalāl, Yom Kippur, Quaresima, Ramadan
...

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM