[英]How would I parse this HTML using BeautifulSoup?
我正在嘗試使用 Python 和 BeautifulSoup 模塊從 Acharts.co 抓取前 100 首歌曲排行榜。 到目前為止,我已經設法在圖表中獲得了給定 position 的歌曲標題,但在獲取藝術家姓名方面我有點卡住了。
import requests
from bs4 import BeautifulSoup
url = "https://acharts.co/canada_singles_top_100/2021/05"
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en,de;q=0.9,en-US;q=0.8,fr-FR;q=0.7,fr;q=0.6,es;q=0.5",
"authority": "acharts.co",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 6.3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 YaBrowser/17.6.1.749 Yowser/2.5 Safari/537.36"
}
response = requests.get(url, headers=headers)
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text, 'html.parser')
for item in soup.select("td"):
if item['class'][0] == 'cPrinciple':
song = item.a.span.get_text()
print(song)
這是我要解析的 HTML 部分:
<td class="cPrinciple" itemprop="item" itemscope itemtype="http://schema.org/MusicRecording">
<a href="https://acharts.co/song/156580" itemprop="url"><span itemprop="name">Mood</span></a>
<br />
<span class="Sub">
<span itemprop="byArtist" itemscope itemtype="http://schema.org/MusicGroup">
<meta itemprop="url" content="https://acharts.co/artist/24kgoldn" />
<span itemprop="name">24Kgoldn</span>
</span> and
<span itemprop="byArtist" itemscope itemtype="http://schema.org/MusicGroup">
<meta itemprop="url" content="https://acharts.co/artist/iann_dior" />
<span itemprop="name">Iann Dior</span>
</span>
</span>
那么在上面的片段中,我將如何 go 提取“Mood”(歌曲名稱)、“24kGldn”(藝術家#1)和“Iann Dior”(藝術家#2)? 提前致謝
你可以這樣做:
soup = BeautifulSoup(response.text, 'html.parser')
for item in soup.select("td"):
if item['class'][0] == 'cPrinciple':
e = item.find("span", { "class" : "Sub" })
if e is not None:
results= e.find_all("span",{"itemprop":"name"})
artists = [x.text for x in results]
song = item.a.span.get_text()
print(artists)
print(song)
更緊湊的方式(使用列表理解):
import requests as rq
from bs4 import BeautifulSoup as bs
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 YaBrowser/17.6.1.749 Yowser/2.5 Safari/537.36"}
url = "https://acharts.co/canada_singles_top_100/2021/05"
resp = rq.get(url, headers=headers)
soup = bs(resp.content)
tbody = soup.find_all("tbody")[0]
rows = [[span.text for span in row.find_all("span", attrs={"itemprop": True}) if not "\n" in span.text] for row in tbody.find_all("tr")]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.