简体   繁体   中英

Scraping text from a website when the text does not appear in the source

I am trying to retrieve the 'Now Playing' information from http://radioplayer.magic.co.uk/live using Python and Beautiful Soup.

I can see the text in a web browser and can copy and paste it so I am assuming this text is downloaded from somewhere, when I look at the page from Beautiful Soup I can't see the text or even where it might be coming from.

I am a beginner at this so please be gentle!

Thanks in advance for sharing your knowledge and experience.

ADDITIONAL INFORMATION: I am using Python 3 on a raspberry pi

The content of Now Playing div is loaded dynamically by making an AJAX request and that is why it is not included in the source page you will received.

What you can do is imitating the ajax request made and fetching the response from there.

This is how you can achieve this :

import requests
import json

main_url = "http://radioplayer.magic.co.uk/live/"
ajax_url = "http://ps1.pubnub.com/subscribe/sub-eff4f180-d0c2-11e1-bee3-1b5222fb6268/np_4/0/14901814159272341?uuid=ef978c6c-2edf-4ff5-910a-39765d038427"
re = requests.get(ajax_url).content
playing_list = json.loads(re)[0]
max_time = 0
playing_now_dict = {}

for playings in playing_list : 
    if int(playings['start_time']) > max_time  : 
        playing_now_dict = playings
print(playing_now_dict.get('title', ''))
print(playing_now_dict.get('artist', ''))

This currently prints :

Young Hearts Run Free
Candi Staton

It seems like a task for python and selenium: http://selenium-python.readthedocs.io/ (this enables you to control the browser and do whatever you can do manually, eg select displayed text)

(Warinng - the Firefox plugin is somewhat picky about the version, last stable version in Ubuntu works only with Firefox up to 45)

If you want to stick to using a headless browser (eg urllib , requests ) then you will have to monitor the network calls while loading the website and get the exact URI (& necessary form data?) to use in python.

OR you could use python-selenium which will work exactly like the browser. Once you load the page, you can use driver.page_source to parse the source through BeautifulSoup.

Also, if you are lucky, maybe the website has an API (json/xml) that lets you fetch what you want without going through the hassle of parsing the raw source.

Using selenium is usually more difficult to install than to actually use. For example, you could try the following out first on a normal PC:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

url = "http://radioplayer.magic.co.uk/live/"
browser = webdriver.Firefox(firefox_binary=FirefoxBinary())
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser')
playlist = soup.find(id='playlist')

print playlist.find('span', class_='artist').text
print playlist.find('span', class_='title').text

This would give you something like:

Level 42
Running In The Family

You will need to investigate which browser driver will be compatible on a Raspberry Pi.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM