Python selenium 獲取“開發者工具”→網絡→媒體日志

Question

我正在嘗試以編程方式做一些必然涉及獲取“開發人員工具”→網絡→媒體日志的事情。

我將不告訴你細節，長話短說，我需要訪問數千個這樣的頁面： https://music.163.com/#/song?id=ID ，其中等號后面的ID是一個數字。

如果打開這樣的頁面，會有一個播放按鈕，該按鈕觸發一個javascript，加載一個在整個頁面中沒有被引用的音樂文件，並播放該文件。 （注：聽一些歌曲可能需要中文IP，聽一些其他歌曲需要VIP賬號。）

例如這個頁面： https://music.163.com/#/song?id=32477986 ，它應該是這樣的：

如果點擊藍色按鈕，則觸發 javascript，音樂文件將由 javascript 加載並播放。 此音樂文件不會成為網頁中的元素，因此無法通過find_element*方法直接抓取。

但是我找到了一種找到音樂文件地址的方法。

在 Firefox 中，按 F12 調出檢查器/“開發者工具”，點擊網絡，然后點擊媒體。 點擊藍色按鈕，會出現多個文件名相同的請求，文件名匹配^[0-9a-f]+\.m4a ，域可能不同。

像這樣：

單擊任何記錄，您將找到它的地址，其中任何一個都可以，如下所示：

我目前正試圖弄清楚如何以編程方式模擬這個過程。

我用谷歌搜索了這個： python selenium 開發人員工具網絡選項卡，並沒有找到我想要的東西，這正是我的預期。 我發布了鏈接以顯示我的研究工作，以及 Google 如何不理解您正在嘗試搜索的內容的含義。

無論如何，我偶然發現了這個： https://www.rkengler.com/how-to-capture-network-traffic-when-scraping-with-selenium-and-python/

並用這些測試：

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
capabilities = DesiredCapabilities.CHROME
capabilities["goog:loggingPrefs"] = {'performance': "ALL"}
driver = webdriver.Chrome(desired_capabilities=capabilities)
wait = WebDriverWait(driver, 15)
driver.get('https://music.163.com/#/song?id=32477986')
iframe = driver.find_element_by_xpath('//iframe[@id="g_iframe"]')
driver.switch_to.frame(iframe)
wait.until(EC.visibility_of_element_located((By.XPATH, '//div[2]/div/a[1]')))
play = driver.find_element_by_xpath('//div[2]/div/a[1]')
play.click()
time.sleep(10)
driver.get_log('performance')

它起作用了，但是 output 太寬泛了，我更喜歡使用 Firefox。

然后，我嘗試使用 Google 查找所有有效loggingPrefs選項： chrome all "loggingPrefs" options ，不幸的是，但不出所料，我什么也找不到，除了browser:ALL和driver:ALL 。

而且我找不到任何指定所有可能開關的文檔。

但我想也許我找到了一種模式，性能是檢查器/開發工具中的一個選項卡，而網絡是另一個選項卡。

所以我用'network'替換了兩次出現的'performance'並再次運行代碼：

InvalidArgumentException: Message: invalid argument: log type 'network' not found
  (Session info: chrome=89.0.4389.90)

這就是我得到的。

無論如何，這是我整理的：

import os
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.headless = True
path = (os.environ['APPDATA'] + '\Mozilla\Firefox\Profiles\Selenium').replace('\\', '/')
profile = webdriver.FirefoxProfile(path)
profile.set_preference("media.volume_scale", "0.0")

capabilities = DesiredCapabilities.FIREFOX
capabilities["loggingPrefs"] = {'performance': 'ALL'}

Firefox = webdriver.Firefox(firefox_profile=profile, desired_capabilities=capabilities, options=options)
wait = WebDriverWait(Firefox, 15)
Firefox.get('https://music.163.com/#/song?id=32477986')
iframe = Firefox.find_element_by_xpath('//iframe[@id="g_iframe"]')
Firefox.switch_to.frame(iframe)
wait.until(EC.visibility_of_element_located((By.XPATH, '//div[2]/div/a[1]')))
play = Firefox.find_element_by_xpath('//div[2]/div/a[1]')
play.click()
time.sleep(10)
Firefox.get_log('performance')

這就是它失敗的原因：

WebDriverException: Message: HTTP method not allowed

我怎么能得到網絡→使用 Python selenium 的媒體日志？ 我什至無法使日志記錄首選項起作用。 我發現的所有東西都在使用“loggingPrefs”鍵，如您所見，它不起作用。 我似乎隱約記得gecko:loggingPrefs但我無法通過谷歌搜索"gecko:loggingPrefs"找到任何東西。

這條評論： Getting console.log output from Firefox with Selenium提到 driver.get_log('browser') 將不再起作用。 但尚不清楚它是否僅適用於browser或所有日志。

如何獲取 Firefox 檢查器日志，之后如何將其縮小到網絡→媒體選項卡？

如果我沒有表現出足夠的研究努力，我真的很抱歉，我到底要如何在不使用 Google 的情況下進行在線研究？ 而且，您是否從自己使用 Google 的經驗中知道，Google 永遠不會理解您的搜索詞的含義，它只會找到包含關鍵字的文檔，其中關鍵字隨機散布在文檔周圍，結果甚至不必包含所有關鍵詞！

谷歌真的是一個糟糕的研究工具，我真的沒有比谷歌更好的東西了。 因此，如果這還不夠研究努力，那么我不知道有什么東西可以算作足夠的研究努力。

那么如何使用 Python 3.9.5 selenium 在 Firefox 中獲取檢查器→網絡→媒體日志？

谷歌把我帶到了這里，坦率地說，現場搜索引擎甚至比谷歌還要糟糕。 我找不到我正在尋找的答案，這正是我在這里提問的原因。

經過更多研究，我終於找到了一些東西： https://stackoverflow.com/a/65538568/15290516

這個答案讓我離目標更近了一步，但我對 javascript 一無所知，並且測試返回：

JavascriptException: Message: Cyclic object value

但它確實指向了正確的方向，解決方案應該涉及.execute_script()來完成工作，但我不知道確切的命令應該是什么，我嘗試谷歌搜索： javascript get "devtools" "network" " media" "logs" ，自己看看它返回了什么。

嗯，我設法使用 Chrome 獲取性能日志並將其重定向到文本文件，然后將其上傳到Google Drive 。

我在文件中找到了地址（Notepad++ 搜索.m4a ），但我不知道如何以編程方式將結果過濾到與音樂文件相關的請求。

我想，現在我會被 Chrome 和性能日志困住。

但我真的不知道如何過濾請求以僅獲取相關請求。 怎么可能呢？

Answer 1

最后我自己完成了，沒有任何人的幫助。

訣竅很簡單，一旦你知道該怎么做，實現起來並不難。

響應采用 json 格式，因此我們需要json模塊。

json 的結構各不相同，但是第一層的key是固定的，總是有3個key： level 、 message 、 timestamp 。

我們需要message鍵，它的值是一個 json object 打包成一個字符串，所以我們需要json.loads來解包。

這些打包的 json 對象的結構變化很大，但在message密鑰內部總是有一個message密鑰和一個method密鑰。

在這里，我們正在嘗試抓取接收到的媒體文件地址，長話短說， message → message → method鍵應該等於'Network.responseReceived' 。

如果message → message → method key 等於'Network.responseReceived' ，那么總會有message → message → params → response → mimeType key。

該密鑰存儲資源的文件類型，我將不詳述，我知道.mp4代表Motion Picture Expert Group-4並且是一種視頻格式，但這里的媒體類型應該是'audio/mp4' 。

如果滿足所有關於條件，則媒體文件的地址是message → message → params → response → url key 的值。

這是最終代碼：

import json
import os
import random
import sys
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

path = (os.environ['LOCALAPPDATA'] + '\\Google\\Chrome\\User Data')

options = webdriver.ChromeOptions()
options.add_argument('--disable-gpu')
options.add_argument('--headless')
options.add_argument('--log-level=3')
options.add_argument('--mute-audio')
options.add_argument(f'--user-data-dir={path}')

capabilities = DesiredCapabilities.CHROME
capabilities["goog:loggingPrefs"] = {'performance': 'ALL'}

Chrome = webdriver.Chrome(options=options, desired_capabilities=capabilities)
wait = WebDriverWait(Chrome, 5)

def getlink(addr):
    Chrome.get(addr)
    iframe = Chrome.find_element_by_xpath('//iframe[@id="g_iframe"]')
    Chrome.switch_to.frame(iframe)
    wait.until(EC.visibility_of_element_located((By.XPATH, '//div[2]/div/a[1]')))
    play = Chrome.find_element_by_xpath('//div[2]/div/a[1]')
    play.click()
    time.sleep(5)
    logs = Chrome.get_log('performance')
    addresses = []
    for i in logs:
        log = json.loads(i['message'])
        if log['message']['method'] == 'Network.responseReceived':
            if log['message']['params']['response']['mimeType'] == 'audio/mp4':
                addresses.append(log['message']['params']['response']['url'])
    check = set([i.split('/')[-1] for i in addresses])
    if len(check) == 1:
        return random.choice(addresses)

if __name__ == '__main__':
    print(getlink(sys.argv[1]))

Answer 2

很棒的代碼，當你還期望'audio/mpeg'時，你會得到更好的結果

Python selenium 獲取“開發者工具”→網絡→媒體日志

問題描述

1 個解決方案

解決方案1
0 2021-06-15 10:48:44

解決方案2
0 2022-01-13 10:28:57

Python selenium 獲取“開發者工具”→網絡→媒體日志

問題描述

1 個解決方案

解決方案1 0 2021-06-15 10:48:44

解決方案2 0 2022-01-13 10:28:57

解決方案1
0 2021-06-15 10:48:44

解決方案2
0 2022-01-13 10:28:57