简体   繁体   中英

Python selenium get "Developer Tools" →Network→Media logs

I am trying to programmatically do something that necessarily involves getting the "developer tools"→network→media logs.

I will spare you the details, long story short, I need to visit thousands of pages like this: https://music.163.com/#/song?id=ID , where ID after the equals sign is a number.

If you open such a page, there will be a play button, the button triggers a javascript that loads a music file that is not referenced in the entire page, and plays the file. (note: you may need a Chinese IP to listen to some songs, and need a VIP account to listen to some other songs.)

For example, this page: https://music.163.com/#/song?id=32477986 , it should look like this:

在此处输入图像描述

If you click the blue button, the javascript is triggered, and the music file will be loaded by javascript and be played. This music file will not be an element in the webpage and therefore can't be directly scraped by find_element* methods.

But I have found a way to find the address of the music file.

In Firefox, press F12 to bring up the inspector/"developer tools", click network then click media. Click the blue button and then there will be multiple requests shown with the same file name, the file name will match ^[0-9a-f]+\.m4a , and the domain may be different.

Like this:

在此处输入图像描述

Click any of the records and you will find its address, any of these will work, like this:

在此处输入图像描述

And I am currently trying to figure out how to programmatically simulate this process.

I Googled this: python selenium developer tools network tab , and didn't find what I was looking for, which is exactly as I expected. I posted the link to show my research effort, and how Google doesn't understand the meaning of what you are trying search for.

Anyway I stumbled upon this: https://www.rkengler.com/how-to-capture-network-traffic-when-scraping-with-selenium-and-python/

And tested with these:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
capabilities = DesiredCapabilities.CHROME
capabilities["goog:loggingPrefs"] = {'performance': "ALL"}
driver = webdriver.Chrome(desired_capabilities=capabilities)
wait = WebDriverWait(driver, 15)
driver.get('https://music.163.com/#/song?id=32477986')
iframe = driver.find_element_by_xpath('//iframe[@id="g_iframe"]')
driver.switch_to.frame(iframe)
wait.until(EC.visibility_of_element_located((By.XPATH, '//div[2]/div/a[1]')))
play = driver.find_element_by_xpath('//div[2]/div/a[1]')
play.click()
time.sleep(10)
driver.get_log('performance')

It worked, but the output is too broad, and I prefer using Firefox.

I then tried to find all valid loggingPrefs options using Google: chrome all "loggingPrefs" options , unfortunately but unsurprisingly I could find nothing, except for browser:ALL and driver:ALL .

And I can't find any documentation that specifies all the possible switches.

But I thought maybe I have found a pattern, performance is a tab in inspector/devtools, and network is another tab.

So I replaced the two occurrences of 'performance' with 'network' and ran the code again:

InvalidArgumentException: Message: invalid argument: log type 'network' not found
  (Session info: chrome=89.0.4389.90)

This is what I got.

Regardless, this is what I had put together:

import os
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.headless = True
path = (os.environ['APPDATA'] + '\Mozilla\Firefox\Profiles\Selenium').replace('\\', '/')
profile = webdriver.FirefoxProfile(path)
profile.set_preference("media.volume_scale", "0.0")

capabilities = DesiredCapabilities.FIREFOX
capabilities["loggingPrefs"] = {'performance': 'ALL'}

Firefox = webdriver.Firefox(firefox_profile=profile, desired_capabilities=capabilities, options=options)
wait = WebDriverWait(Firefox, 15)
Firefox.get('https://music.163.com/#/song?id=32477986')
iframe = Firefox.find_element_by_xpath('//iframe[@id="g_iframe"]')
Firefox.switch_to.frame(iframe)
wait.until(EC.visibility_of_element_located((By.XPATH, '//div[2]/div/a[1]')))
play = Firefox.find_element_by_xpath('//div[2]/div/a[1]')
play.click()
time.sleep(10)
Firefox.get_log('performance')

And this is how it failed:

WebDriverException: Message: HTTP method not allowed

How the heavens I can get the Network→Media logs using Python selenium? I can't even make the logging preferences work. All the thing I have found are using 'loggingPrefs' key, and as you see it doesn't work. I seem to vaguely remember gecko:loggingPrefs but I can't find anything by Googling "gecko:loggingPrefs" .

And this comment: Getting console.log output from Firefox with Selenium mentions driver.get_log('browser') will not work anymore. But it's unclear whether it applies to only browser or all the logs.

How can I get the Firefox inspector logs and how can I narrow it down to network→media tab after that?

I am really sorry if I haven't show enough research effort, how the hell am I going to research something online without using Google? And don't you know enough from your own experience using Google that Google never understands the meaning of your search terms and it only finds documents containing the keywords where the keywords randomly scatter around the document and the result doesn't even have to contain all keywords!

Google really is a bad researching tool and I really don't have anything better than Google. So if that's not enough research effort then I don't know anything that will qualify as enough research effort.

So how can I get inspector→network→media logs in Firefox using Python 3.9.5 selenium?


And Google leads me here, and frankly the onsite search engine is even worse than Google. I can't find the answer to what I am looking for which is precisely why I asked questions here.


After some more research I have finally found something: https://stackoverflow.com/a/65538568/15290516

This answer takes me one step closer to my goal, but I don't know a thing about javascript, and the testing returns:

JavascriptException: Message: Cyclic object value

But it does point to the right direction, the solution should involve .execute_script() to get the job done, but I don't know exactly what the commands should be, I tried Googling this: javascript get "devtools" "network" "media" "logs" , see for yourself what it returns.


Hmm, I managed to get the performance log with Chrome and redirect it to a text file, I uploaded it to Google Drive .

I have found the address in the file (Notepad++ search .m4a ), but I don't know how to filter the result to the requests relevant to the music file programmatically.

I think, for now I will be stuck with Chrome and performance log.

But I really have no idea how to filter the requests to get only the relevant requests. How can that be done?

Finally I have done it, all by myself, without anybody's help.

The trick is simple, once you know what to do, it isn't so hard to achieve.

The responses are in json format, so we need the json module.

The structure of the json varies, but the first level keys are fixed, there are always three keys: level , message , timestamp .

We need the message key, its value is a json object packed in a string, so we need json.loads to unpack it.

The structure of these packed json objects varies a lot, but there is always a message key and a method key inside the message key.

Here we are trying to scrape received media file addresses, and long story short, the messagemessagemethod key should equal to 'Network.responseReceived' .

If messagemessagemethod key equals to 'Network.responseReceived' , then there will always be a messagemessageparamsresponsemimeType key.

That key stores the file type of the resource, I will spare you the details, I know .mp4 stands for Motion Picture Expert Group-4 and is a video format, but here the media type should be 'audio/mp4' .

If all the about criteria are satisfied then the address of the media file is the value of messagemessageparamsresponseurl key.

This is the final code:

import json
import os
import random
import sys
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

path = (os.environ['LOCALAPPDATA'] + '\\Google\\Chrome\\User Data')

options = webdriver.ChromeOptions()
options.add_argument('--disable-gpu')
options.add_argument('--headless')
options.add_argument('--log-level=3')
options.add_argument('--mute-audio')
options.add_argument(f'--user-data-dir={path}')

capabilities = DesiredCapabilities.CHROME
capabilities["goog:loggingPrefs"] = {'performance': 'ALL'}

Chrome = webdriver.Chrome(options=options, desired_capabilities=capabilities)
wait = WebDriverWait(Chrome, 5)

def getlink(addr):
    Chrome.get(addr)
    iframe = Chrome.find_element_by_xpath('//iframe[@id="g_iframe"]')
    Chrome.switch_to.frame(iframe)
    wait.until(EC.visibility_of_element_located((By.XPATH, '//div[2]/div/a[1]')))
    play = Chrome.find_element_by_xpath('//div[2]/div/a[1]')
    play.click()
    time.sleep(5)
    logs = Chrome.get_log('performance')
    addresses = []
    for i in logs:
        log = json.loads(i['message'])
        if log['message']['method'] == 'Network.responseReceived':
            if log['message']['params']['response']['mimeType'] == 'audio/mp4':
                addresses.append(log['message']['params']['response']['url'])
    check = set([i.split('/')[-1] for i in addresses])
    if len(check) == 1:
        return random.choice(addresses)

if __name__ == '__main__':
    print(getlink(sys.argv[1]))

great code, you would get better results when expect 'audio/mpeg' additionally

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM