繁体   English   中英

无法使用 lxml xpath 获取 xml 元素值

[英]Can't get the xml element value using lxml xpath

我正在尝试抓取一个 Spotify 播放列表网页以提取艺术家和歌曲名称数据。 这是我的 python 代码:

#! /usr/bin/python
from lxml import html
import requests

playlistPage = requests.get('https://open.spotify.com/playlist/0csaTlUWTfiyXscv4qKDGE')
print("\n\nprinting variable playListPage: " + str(playlistPage))
tree = html.fromstring(playlistPage.content)
print("printing variable tree: " + str(tree))

artistList = tree.xpath("//span/a[@class='tracklist-row__artist-name-link']/text()")
print("printing variable artistList: " + str(artistList) + "\n\n")

现在最后的打印语句正在打印出一个空列表。

这是我要抓取的页面中的一些示例 HTML 。 理想情况下,我的代码会提取字符串“M83”...不确定 html 与多少相关,因此粘贴我认为必要的内容:

<div class="react-contextmenu-wrapper">
<div draggable="true">
<li class="tracklist-row" role="button" tabindex="0" data-testid="tracklist-row">
<div class="tracklist-col position-outer">
<div class="tracklist-play-pause tracklist-top-align">
<svg class="icon-play" viewBox="0 0 85 100">
<path fill="currentColor" d="M81 44.6c5 3 5 7.8 0 10.8L9 98.7c-5 3-9 .7-9-5V6.3c0-5.7 4-8 9-5l72 43.3z">
<title>
PLAY</title>
</path>
</svg>
</div>
<div class="position tracklist-top-align">
<span class="spoticon-track-16">
</span>
</div>
</div>
<div class="tracklist-col name">
<div class="track-name-wrapper tracklist-top-align">
<div class="tracklist-name ellipsis-one-line" dir="auto">
Intro</div>
<div class="second-line">
<span class="TrackListRow__artists ellipsis-one-line" dir="auto">
<span class="react-contextmenu-wrapper">
<span draggable="true">
<a tabindex="-1" class="tracklist-row__artist-name-link" href="/artist/63MQldklfxkjYDoUE4Tppz">
M83</a>
</span>
</span>
</span>
<span class="second-line-separator" aria-label="in album">
•</span>
<span class="TrackListRow__album ellipsis-one-line" dir="auto">
<span class="react-contextmenu-wrapper">
<span draggable="true">
<a tabindex="-1" class="tracklist-row__album-name-link" href="/album/6R0ynY7RF20ofs9GJR5TXR">
Hurry Up, We're Dreaming</a>
</span>
</span>
</span>
</div>
</div>
</div>
<div class="tracklist-col more">
<div class="tracklist-top-align">
<div class="react-contextmenu-wrapper">
<button class="_2221af4e93029bedeab751d04fab4b8b-scss c74a35c3aba27d72ee478f390f5d8c16-scss" type="button">
<div class="spoticon-ellipsis-16">
</div>
</button>
</div>
</div>
</div>
<div class="tracklist-col tracklist-col-duration">
<div class="tracklist-duration tracklist-top-align">
<span>
5:22</span>
</div>
</div>
</li>
</div>
</div>

使用 Beautiful Soup 的解决方案:

import requests
from bs4 import BeautifulSoup as bs

page = requests.get('https://open.spotify.com/playlist/0csaTlUWTfiyXscv4qKDGE')
soup = bs(page.content, 'lxml')
tracklist_container = soup.find("div", {"class": "tracklist-container"})
track_artists_container = tracklist_container.findAll("span", {"class": "artists-albums"})
artists = []
for ta in track_artists_container:
    artists.append(ta.find("span").text)
print(artists[0])

印刷

M83

此解决方案获取页面上的所有艺术家,因此您可以打印出artists列表并获得:

['M83',
 'Charles Bradley',
 'Bon Iver',
 ...
 'Death Cab for Cutie',
 'Destroyer']

您可以通过更改findAll(...) function 调用中的类名轻松地将其扩展到曲目名称和专辑。

@eNc 提供了很好的答案。 lxml解决方案:

from lxml import html
import requests

playlistPage = requests.get('https://open.spotify.com/playlist/0csaTlUWTfiyXscv4qKDGE')
tree = html.fromstring(playlistPage.content)
artistList = tree.xpath("//span[@class='artists-albums']/a[1]/span/text()")
print(artistList)

Output:

['M83', 'Charles Bradley', 'Bon Iver', 'The Middle East', 'The Antlers', 'Handsome Furs', 'Frank Turner', 'Frank Turner', 'Amy Winehouse', 'Black Lips', 'M83', 'Florence + The Machine', 'Childish Gambino', 'DJ Khaled', 'Kendrick Lamar', 'Future Islands', 'Future Islands', 'JAY-Z', 'Blood Orange', 'Cut Copy', 'Rihanna', 'Tedeschi Trucks Band', 'Bill Callahan', 'St. Vincent', 'Adele', 'Beirut', 'Childish Gambino', 'David Guetta', 'Death Cab for Cutie', 'Destroyer']

由于您无法一次获得所有结果,也许您应该切换到Selenium

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM