简体   繁体   English

Python selenium 获取“开发者工具”→网络→媒体日志

[英]Python selenium get "Developer Tools" →Network→Media logs

I am trying to programmatically do something that necessarily involves getting the "developer tools"→network→media logs.我正在尝试以编程方式做一些必然涉及获取“开发人员工具”→网络→媒体日志的事情。

I will spare you the details, long story short, I need to visit thousands of pages like this: https://music.163.com/#/song?id=ID , where ID after the equals sign is a number.我将不告诉你细节,长话短说,我需要访问数千个这样的页面: https://music.163.com/#/song?id=ID ,其中等号后面的ID是一个数字。

If you open such a page, there will be a play button, the button triggers a javascript that loads a music file that is not referenced in the entire page, and plays the file.如果打开这样的页面,会有一个播放按钮,该按钮触发一个javascript,加载一个在整个页面中没有被引用的音乐文件,并播放该文件。 (note: you may need a Chinese IP to listen to some songs, and need a VIP account to listen to some other songs.) (注:听一些歌曲可能需要中文IP,听一些其他歌曲需要VIP账号。)

For example, this page: https://music.163.com/#/song?id=32477986 , it should look like this:例如这个页面: https://music.163.com/#/song?id=32477986 ,它应该是这样的:

在此处输入图像描述

If you click the blue button, the javascript is triggered, and the music file will be loaded by javascript and be played.如果点击蓝色按钮,则触发 javascript,音乐文件将由 javascript 加载并播放。 This music file will not be an element in the webpage and therefore can't be directly scraped by find_element* methods.此音乐文件不会成为网页中的元素,因此无法通过find_element*方法直接抓取。

But I have found a way to find the address of the music file.但是我找到了一种找到音乐文件地址的方法。

In Firefox, press F12 to bring up the inspector/"developer tools", click network then click media.在 Firefox 中,按 F12 调出检查器/“开发者工具”,点击网络,然后点击媒体。 Click the blue button and then there will be multiple requests shown with the same file name, the file name will match ^[0-9a-f]+\.m4a , and the domain may be different.点击蓝色按钮,会出现多个文件名相同的请求,文件名匹配^[0-9a-f]+\.m4a ,域可能不同。

Like this:像这样:

在此处输入图像描述

Click any of the records and you will find its address, any of these will work, like this:单击任何记录,您将找到它的地址,其中任何一个都可以,如下所示:

在此处输入图像描述

And I am currently trying to figure out how to programmatically simulate this process.我目前正试图弄清楚如何以编程方式模拟这个过程。

I Googled this: python selenium developer tools network tab , and didn't find what I was looking for, which is exactly as I expected.我用谷歌搜索了这个: python selenium 开发人员工具网络选项卡,并没有找到我想要的东西,这正是我的预期。 I posted the link to show my research effort, and how Google doesn't understand the meaning of what you are trying search for.我发布了链接以显示我的研究工作,以及 Google 如何不理解您正在尝试搜索的内容的含义。

Anyway I stumbled upon this: https://www.rkengler.com/how-to-capture-network-traffic-when-scraping-with-selenium-and-python/无论如何,我偶然发现了这个: https://www.rkengler.com/how-to-capture-network-traffic-when-scraping-with-selenium-and-python/

And tested with these:并用这些测试:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
capabilities = DesiredCapabilities.CHROME
capabilities["goog:loggingPrefs"] = {'performance': "ALL"}
driver = webdriver.Chrome(desired_capabilities=capabilities)
wait = WebDriverWait(driver, 15)
driver.get('https://music.163.com/#/song?id=32477986')
iframe = driver.find_element_by_xpath('//iframe[@id="g_iframe"]')
driver.switch_to.frame(iframe)
wait.until(EC.visibility_of_element_located((By.XPATH, '//div[2]/div/a[1]')))
play = driver.find_element_by_xpath('//div[2]/div/a[1]')
play.click()
time.sleep(10)
driver.get_log('performance')

It worked, but the output is too broad, and I prefer using Firefox.它起作用了,但是 output 太宽泛了,我更喜欢使用 Firefox。

I then tried to find all valid loggingPrefs options using Google: chrome all "loggingPrefs" options , unfortunately but unsurprisingly I could find nothing, except for browser:ALL and driver:ALL .然后,我尝试使用 Google 查找所有有效loggingPrefs选项: chrome all "loggingPrefs" options ,不幸的是,但不出所料,我什么也找不到,除了browser:ALLdriver:ALL

And I can't find any documentation that specifies all the possible switches.而且我找不到任何指定所有可能开关的文档。

But I thought maybe I have found a pattern, performance is a tab in inspector/devtools, and network is another tab.但我想也许我找到了一种模式,性能是检查器/开发工具中的一个选项卡,而网络是另一个选项卡。

So I replaced the two occurrences of 'performance' with 'network' and ran the code again:所以我用'network'替换了两次出现的'performance'并再次运行代码:

InvalidArgumentException: Message: invalid argument: log type 'network' not found
  (Session info: chrome=89.0.4389.90)

This is what I got.这就是我得到的。

Regardless, this is what I had put together:无论如何,这是我整理的:

import os
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.headless = True
path = (os.environ['APPDATA'] + '\Mozilla\Firefox\Profiles\Selenium').replace('\\', '/')
profile = webdriver.FirefoxProfile(path)
profile.set_preference("media.volume_scale", "0.0")

capabilities = DesiredCapabilities.FIREFOX
capabilities["loggingPrefs"] = {'performance': 'ALL'}

Firefox = webdriver.Firefox(firefox_profile=profile, desired_capabilities=capabilities, options=options)
wait = WebDriverWait(Firefox, 15)
Firefox.get('https://music.163.com/#/song?id=32477986')
iframe = Firefox.find_element_by_xpath('//iframe[@id="g_iframe"]')
Firefox.switch_to.frame(iframe)
wait.until(EC.visibility_of_element_located((By.XPATH, '//div[2]/div/a[1]')))
play = Firefox.find_element_by_xpath('//div[2]/div/a[1]')
play.click()
time.sleep(10)
Firefox.get_log('performance')

And this is how it failed:这就是它失败的原因:

WebDriverException: Message: HTTP method not allowed

How the heavens I can get the Network→Media logs using Python selenium?我怎么能得到网络→使用 Python selenium 的媒体日志? I can't even make the logging preferences work.我什至无法使日志记录首选项起作用。 All the thing I have found are using 'loggingPrefs' key, and as you see it doesn't work.我发现的所有东西都在使用“loggingPrefs”键,如您所见,它不起作用。 I seem to vaguely remember gecko:loggingPrefs but I can't find anything by Googling "gecko:loggingPrefs" .我似乎隐约记得gecko:loggingPrefs但我无法通过谷歌搜索"gecko:loggingPrefs"找到任何东西。

And this comment: Getting console.log output from Firefox with Selenium mentions driver.get_log('browser') will not work anymore.这条评论: Getting console.log output from Firefox with Selenium提到 driver.get_log('browser') 将不再起作用。 But it's unclear whether it applies to only browser or all the logs.但尚不清楚它是否仅适用于browser或所有日志。

How can I get the Firefox inspector logs and how can I narrow it down to network→media tab after that?如何获取 Firefox 检查器日志,之后如何将其缩小到网络→媒体选项卡?

I am really sorry if I haven't show enough research effort, how the hell am I going to research something online without using Google?如果我没有表现出足够的研究努力,我真的很抱歉,我到底要如何在不使用 Google 的情况下进行在线研究? And don't you know enough from your own experience using Google that Google never understands the meaning of your search terms and it only finds documents containing the keywords where the keywords randomly scatter around the document and the result doesn't even have to contain all keywords!而且,您是否从自己使用 Google 的经验中知道,Google 永远不会理解您的搜索词的含义,它只会找到包含关键字的文档,其中关键字随机散布在文档周围,结果甚至不必包含所有关键词!

Google really is a bad researching tool and I really don't have anything better than Google.谷歌真的是一个糟糕的研究工具,我真的没有比谷歌更好的东西了。 So if that's not enough research effort then I don't know anything that will qualify as enough research effort.因此,如果这还不够研究努力,那么我不知道有什么东西可以算作足够的研究努力。

So how can I get inspector→network→media logs in Firefox using Python 3.9.5 selenium?那么如何使用 Python 3.9.5 selenium 在 Firefox 中获取检查器→网络→媒体日志?


And Google leads me here, and frankly the onsite search engine is even worse than Google.谷歌把我带到了这里,坦率地说,现场搜索引擎甚至比谷歌还要糟糕。 I can't find the answer to what I am looking for which is precisely why I asked questions here.我找不到我正在寻找的答案,这正是我在这里提问的原因。


After some more research I have finally found something: https://stackoverflow.com/a/65538568/15290516经过更多研究,我终于找到了一些东西: https://stackoverflow.com/a/65538568/15290516

This answer takes me one step closer to my goal, but I don't know a thing about javascript, and the testing returns:这个答案让我离目标更近了一步,但我对 javascript 一无所知,并且测试返回:

JavascriptException: Message: Cyclic object value

But it does point to the right direction, the solution should involve .execute_script() to get the job done, but I don't know exactly what the commands should be, I tried Googling this: javascript get "devtools" "network" "media" "logs" , see for yourself what it returns.但它确实指向了正确的方向,解决方案应该涉及.execute_script()来完成工作,但我不知道确切的命令应该是什么,我尝试谷歌搜索: javascript get "devtools" "network" " media" "logs" ,自己看看它返回了什么。


Hmm, I managed to get the performance log with Chrome and redirect it to a text file, I uploaded it to Google Drive .嗯,我设法使用 Chrome 获取性能日志并将其重定向到文本文件,然后将其上传到Google Drive

I have found the address in the file (Notepad++ search .m4a ), but I don't know how to filter the result to the requests relevant to the music file programmatically.我在文件中找到了地址(Notepad++ 搜索.m4a ),但我不知道如何以编程方式将结果过滤到与音乐文件相关的请求。

I think, for now I will be stuck with Chrome and performance log.我想,现在我会被 Chrome 和性能日志困住。

But I really have no idea how to filter the requests to get only the relevant requests.但我真的不知道如何过滤请求以仅获取相关请求。 How can that be done?怎么可能呢?

Finally I have done it, all by myself, without anybody's help.最后我自己完成了,没有任何人的帮助。

The trick is simple, once you know what to do, it isn't so hard to achieve.诀窍很简单,一旦你知道该怎么做,实现起来并不难。

The responses are in json format, so we need the json module.响应采用 json 格式,因此我们需要json模块。

The structure of the json varies, but the first level keys are fixed, there are always three keys: level , message , timestamp . json 的结构各不相同,但是第一层的key是固定的,总是有3个key: levelmessagetimestamp

We need the message key, its value is a json object packed in a string, so we need json.loads to unpack it.我们需要message键,它的值是一个 json object 打包成一个字符串,所以我们需要json.loads来解包。

The structure of these packed json objects varies a lot, but there is always a message key and a method key inside the message key.这些打包的 json 对象的结构变化很大,但在message密钥内部总是有一个message密钥和一个method密钥。

Here we are trying to scrape received media file addresses, and long story short, the messagemessagemethod key should equal to 'Network.responseReceived' .在这里,我们正在尝试抓取接收到的媒体文件地址,长话短说, messagemessagemethod键应该等于'Network.responseReceived'

If messagemessagemethod key equals to 'Network.responseReceived' , then there will always be a messagemessageparamsresponsemimeType key.如果messagemessagemethod key 等于'Network.responseReceived' ,那么总会有messagemessageparamsresponsemimeType key。

That key stores the file type of the resource, I will spare you the details, I know .mp4 stands for Motion Picture Expert Group-4 and is a video format, but here the media type should be 'audio/mp4' .该密钥存储资源的文件类型,我将不详述,我知道.mp4代表Motion Picture Expert Group-4并且是一种视频格式,但这里的媒体类型应该是'audio/mp4'

If all the about criteria are satisfied then the address of the media file is the value of messagemessageparamsresponseurl key.如果满足所有关于条件,则媒体文件的地址是messagemessageparamsresponseurl key 的值。

This is the final code:这是最终代码:

import json
import os
import random
import sys
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

path = (os.environ['LOCALAPPDATA'] + '\\Google\\Chrome\\User Data')

options = webdriver.ChromeOptions()
options.add_argument('--disable-gpu')
options.add_argument('--headless')
options.add_argument('--log-level=3')
options.add_argument('--mute-audio')
options.add_argument(f'--user-data-dir={path}')

capabilities = DesiredCapabilities.CHROME
capabilities["goog:loggingPrefs"] = {'performance': 'ALL'}

Chrome = webdriver.Chrome(options=options, desired_capabilities=capabilities)
wait = WebDriverWait(Chrome, 5)

def getlink(addr):
    Chrome.get(addr)
    iframe = Chrome.find_element_by_xpath('//iframe[@id="g_iframe"]')
    Chrome.switch_to.frame(iframe)
    wait.until(EC.visibility_of_element_located((By.XPATH, '//div[2]/div/a[1]')))
    play = Chrome.find_element_by_xpath('//div[2]/div/a[1]')
    play.click()
    time.sleep(5)
    logs = Chrome.get_log('performance')
    addresses = []
    for i in logs:
        log = json.loads(i['message'])
        if log['message']['method'] == 'Network.responseReceived':
            if log['message']['params']['response']['mimeType'] == 'audio/mp4':
                addresses.append(log['message']['params']['response']['url'])
    check = set([i.split('/')[-1] for i in addresses])
    if len(check) == 1:
        return random.choice(addresses)

if __name__ == '__main__':
    print(getlink(sys.argv[1]))

great code, you would get better results when expect 'audio/mpeg' additionally很棒的代码,当你还期望'audio/mpeg'时,你会得到更好的结果

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 python selenium 获取浏览器网络日志 - How to get browser network logs using python selenium Python Selenium 4.3 无法在 Chromedriver 中使用网络数据获取性能日志 - Python Selenium 4.3 unable to get performace logs in Chromedriver with network data Python Selenium:使用 execute_cdp_cmd() 捕获 Chrome 开发工具网络请求/响应日志 - Python Selenium : Capture Chrome Dev Tools Network Request/Response Logs using execute_cdp_cmd() 我可以使用 python 访问谷歌浏览器开发者工具的网络选项卡吗? - can I access to network tab of google chrome developer tools with python? Python 请求标头不起作用 - 检查 Chrome 开发者工具 -> 网络 - Python requests header not working - Checked Chrome developer tools -> Network 如何使用 python 在 chrome 开发人员工具中访问网络选项卡 - How to access network tab in chrome developer tools using python 如何使用 python selenium 访问 google chrome 开发人员工具上的安全面板? - How to access Security panel on google chrome developer tools with python selenium? Selenium Python - 获取网络响应正文 - Selenium Python - Get Network response body 如何使用 python selenium webdriver 在 Chrome 开发人员工具控制台中进行 fetch 调用 - How to make a fetch call in Chrome developer tools console using python selenium webdriver 如何使用python selenium获取Internet Explorer的浏览器控制台日志 - How to get the browser console logs of internet explorer using python selenium
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM