简体   繁体   中英

Can't get the xml element value using lxml xpath

I am trying to scrape a spotify playlist webpage to pull out artist and song name data. Here is my python code:

#! /usr/bin/python
from lxml import html
import requests

playlistPage = requests.get('https://open.spotify.com/playlist/0csaTlUWTfiyXscv4qKDGE')
print("\n\nprinting variable playListPage: " + str(playlistPage))
tree = html.fromstring(playlistPage.content)
print("printing variable tree: " + str(tree))

artistList = tree.xpath("//span/a[@class='tracklist-row__artist-name-link']/text()")
print("printing variable artistList: " + str(artistList) + "\n\n")

Right now the final print statement is printing out an empty list.

Here is some example HTML from the page I'm trying to scrape. Ideally my code would pull out the string "M83"...not sure how much html is relevant so pasting what I believe necessary:

<div class="react-contextmenu-wrapper">
<div draggable="true">
<li class="tracklist-row" role="button" tabindex="0" data-testid="tracklist-row">
<div class="tracklist-col position-outer">
<div class="tracklist-play-pause tracklist-top-align">
<svg class="icon-play" viewBox="0 0 85 100">
<path fill="currentColor" d="M81 44.6c5 3 5 7.8 0 10.8L9 98.7c-5 3-9 .7-9-5V6.3c0-5.7 4-8 9-5l72 43.3z">
<title>
PLAY</title>
</path>
</svg>
</div>
<div class="position tracklist-top-align">
<span class="spoticon-track-16">
</span>
</div>
</div>
<div class="tracklist-col name">
<div class="track-name-wrapper tracklist-top-align">
<div class="tracklist-name ellipsis-one-line" dir="auto">
Intro</div>
<div class="second-line">
<span class="TrackListRow__artists ellipsis-one-line" dir="auto">
<span class="react-contextmenu-wrapper">
<span draggable="true">
<a tabindex="-1" class="tracklist-row__artist-name-link" href="/artist/63MQldklfxkjYDoUE4Tppz">
M83</a>
</span>
</span>
</span>
<span class="second-line-separator" aria-label="in album">
•</span>
<span class="TrackListRow__album ellipsis-one-line" dir="auto">
<span class="react-contextmenu-wrapper">
<span draggable="true">
<a tabindex="-1" class="tracklist-row__album-name-link" href="/album/6R0ynY7RF20ofs9GJR5TXR">
Hurry Up, We're Dreaming</a>
</span>
</span>
</span>
</div>
</div>
</div>
<div class="tracklist-col more">
<div class="tracklist-top-align">
<div class="react-contextmenu-wrapper">
<button class="_2221af4e93029bedeab751d04fab4b8b-scss c74a35c3aba27d72ee478f390f5d8c16-scss" type="button">
<div class="spoticon-ellipsis-16">
</div>
</button>
</div>
</div>
</div>
<div class="tracklist-col tracklist-col-duration">
<div class="tracklist-duration tracklist-top-align">
<span>
5:22</span>
</div>
</div>
</li>
</div>
</div>

A solution using Beautiful Soup:

import requests
from bs4 import BeautifulSoup as bs

page = requests.get('https://open.spotify.com/playlist/0csaTlUWTfiyXscv4qKDGE')
soup = bs(page.content, 'lxml')
tracklist_container = soup.find("div", {"class": "tracklist-container"})
track_artists_container = tracklist_container.findAll("span", {"class": "artists-albums"})
artists = []
for ta in track_artists_container:
    artists.append(ta.find("span").text)
print(artists[0])

prints

M83

This solution gets all the artists on the page so you could print out the list artists and get:

['M83',
 'Charles Bradley',
 'Bon Iver',
 ...
 'Death Cab for Cutie',
 'Destroyer']

And you can extend this to track names and albums quite easily by changing the classname in the findAll(...) function call.

Nice answer provided by @eNc. lxml solution:

from lxml import html
import requests

playlistPage = requests.get('https://open.spotify.com/playlist/0csaTlUWTfiyXscv4qKDGE')
tree = html.fromstring(playlistPage.content)
artistList = tree.xpath("//span[@class='artists-albums']/a[1]/span/text()")
print(artistList)

Output:

['M83', 'Charles Bradley', 'Bon Iver', 'The Middle East', 'The Antlers', 'Handsome Furs', 'Frank Turner', 'Frank Turner', 'Amy Winehouse', 'Black Lips', 'M83', 'Florence + The Machine', 'Childish Gambino', 'DJ Khaled', 'Kendrick Lamar', 'Future Islands', 'Future Islands', 'JAY-Z', 'Blood Orange', 'Cut Copy', 'Rihanna', 'Tedeschi Trucks Band', 'Bill Callahan', 'St. Vincent', 'Adele', 'Beirut', 'Childish Gambino', 'David Guetta', 'Death Cab for Cutie', 'Destroyer']

Since you can't get all the results in one shot, maybe you should switch for Selenium .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM