简体   繁体   中英

How to scrape the names of all the artists from the table using Selenium and Python?

I am trying to scrape a website of the top 1000 artists and append them to a list in order to perform a lyrical analysis by searching the artists' names. The website I am using has the option to display All 1000 artists at once and so I used selenium to select that choice. From there, I find the artist names and have them in a list of WebElements. I iterate through the list in order to get the text element and append it to my list. The program keeps throwing a StaleElementReferenceException after obtaining a certain number of artists as shown below.

在此处输入图像描述

I tried a number of suggested options such as using a wait until statement or a try and catch statement but ended up crashing the program. Most solutions I have seen occurred when clicking or interacting with a web element however I am not changing anything on the page after I select my option so I am not sure where I am going wrong. I am fairly new to selenium so I am not sure if this is the best way to obtain the artist names. Any help would be appreciated.

My code:

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://chartmasters.org/most-streamed-artists-ever-on-spotify/')

try:
    # get the select tag
    select = Select(driver.find_element(By.TAG_NAME,'#table_1_length > label > div > select'))
    # select by value (select All option to get all 1000 artists)
    select.select_by_value('-1')

    all_artists = []
    all_artists_references = driver.find_elements(By.CLASS_NAME, 'bolded.column-artist-name')

    for element in all_artists_references:
        print(element.text)
        all_artists.append(element.text)

    print(all_artists)

finally:
    driver.quit()

To extract and print all the 1000 artist names you need to induce WebDriverWait for visibility_of_all_elements_located() using List Comprehension you can use either of the following Locator Strategies :

  • Using CSS_SELECTOR :

     print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table#table_1 tbody tr[role='row'] td:nth-of-type(2)")))])
  • Using XPATH :

     print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@id='table_1']//tbody//tr[@role='row']//following::td[2]")))])
  • Note : You have to add the following imports:

     from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC

Rather lengthy form query to get the exact table, but much more efficient to get the data straight from the source.

import requests
import pandas as pd

url = 'https://chartmasters.org/wp-admin/admin-ajax.php'
params = {
    'action': 'get_wdtable',
    'table_id': '1'}
data = {
'draw': '1',
'columns[0][data]': '0',
'columns[0][name]': 'rank',
'columns[0][searchable]': 'true',
'columns[0][orderable]': 'false',
'columns[0][search][value]': '',
'columns[0][search][regex]': 'false',
'columns[1][data]': '1',
'columns[1][name]': 'Artist Name',
'columns[1][searchable]': 'true',
'columns[1][orderable]': 'false',
'columns[1][search][value]': '',
'columns[1][search][regex]': 'false',
'columns[2][data]': '2',
'columns[2][name]': 'Lead Streams',
'columns[2][searchable]': 'true',
'columns[2][orderable]': 'true',
'columns[2][search][value]': '',
'columns[2][search][regex]': 'false',
'columns[3][data]': '3',
'columns[3][name]': 'Featured Streams',
'columns[3][searchable]': 'true',
'columns[3][orderable]': 'true',
'columns[3][search][value]': '',
'columns[3][search][regex]': 'false',
'columns[4][data]': '4',
'columns[4][name]': 'Tracks',
'columns[4][searchable]': 'true',
'columns[4][orderable]': 'true',
'columns[4][search][value]': '',
'columns[4][search][regex]': 'false',
'columns[5][data]': '5',
'columns[5][name]': '1b+',
'columns[5][searchable]': 'true',
'columns[5][orderable]': 'true',
'columns[5][search][value]': '',
'columns[5][search][regex]': 'false',
'columns[6][data]': '6',
'columns[6][name]': '100m+',
'columns[6][searchable]': 'true',
'columns[6][orderable]': 'true',
'columns[6][search][value]': '',
'columns[6][search][regex]': 'false',
'columns[7][data]': '7',
'columns[7][name]': '10m+',
'columns[7][searchable]': 'true',
'columns[7][orderable]': 'true',
'columns[7][search][value]': '',
'columns[7][search][regex]': 'false',
'columns[8][data]': '8',
'columns[8][name]': '1m+',
'columns[8][searchable]': 'true',
'columns[8][orderable]': 'true',
'columns[8][search][value]': '',
'columns[8][search][regex]': 'false',
'columns[9][data]': '9',
'columns[9][name]': 'Last Update',
'columns[9][searchable]': 'true',
'columns[9][orderable]': 'true',
'columns[9][search][value]': '',
'columns[9][search][regex]': 'false',
'order[0][column]': '2',
'order[0][dir]': 'desc',
'start': '0',
'length': '9999',
'search[value]': '',
'search[regex]': 'false',
'wdtNonce': '64ac23afe1'}


cols = []
for k, v in data.items():
    if 'name' in k:
        cols.append(v)

jsonData = requests.post(url, params=params, data=data).json()
df = pd.DataFrame(jsonData['data'], columns=cols)

Output:

print(df)
     rank    Artist Name    Lead Streams  ... 10m+  1m+ Last Update
0       1          Drake  45,625,377,884  ...  241  244    29.03.22
1       2     Ed Sheeran  34,724,649,138  ...  165  199    29.03.22
2       3      Bad Bunny  33,419,082,838  ...  134  140    29.03.22
3       4     The Weeknd  30,455,269,996  ...  143  161    29.03.22
4       5  Ariana Grande  30,021,891,319  ...  126  175    29.03.22
..    ...            ...             ...  ...  ...  ...         ...
995   996          HONNE   1,229,848,408  ...   29   85    18.12.21
996   997  Darius Rucker   1,229,826,891  ...   14   77    28.03.22
997   998       King Von   1,224,925,368  ...   34   68    14.03.22
998   999        JP Saxe   1,224,510,818  ...   13   30    24.03.22
999  1000        Showtek   1,223,338,892  ...   19   69    26.02.21

[1000 rows x 10 columns]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM