简体   繁体   English

如何使用 python 保存来自网站的所有网络流量(包括请求和响应标头)

[英]How to save all the network traffic (both request and response headers) from a website using python

I am trying to find an object that is downloaded into the browser during the loading of a website.我试图找到在加载网站期间下载到浏览器中的 object。

This is the website, https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en ,这是网站, https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en

I'm not very good with web technology and such.我不太擅长 web 技术等等。

I am trying to save the request and response headers and the actual response using only the link to the website.我正在尝试仅使用网站链接来保存请求和响应标头以及实际响应。

If you look at the.network traffic, you can see an object jobsearch.ftl?lang=en that loads towards the end and you can see the reponse and headers.如果您查看 .network 流量,您可以看到一个 object jobsearch.ftl?lang=en加载到最后,您可以看到响应和标头。

Here are the screenshots, of the.network event log showing the request and response headers.以下是显示请求和响应标头的 .network 事件日志的屏幕截图。

网络事件日志

And the actual response.以及实际的反应。

回复

These are the objects that I want to save.这些是我要保存的对象。 How can I do that?我怎样才能做到这一点?

I have tried我努力了

import json
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, TimeoutException, StaleElementReferenceException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

chromepath = "~/chromedriver/chromedriver"

caps = DesiredCapabilities.CHROME
caps['goog:loggingPrefs'] = {'performance': 'ALL'}
driver = webdriver.Chrome(executable_path=chromepath, desired_capabilities=caps)
driver.get('https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en')

def process_browser_log_entry(entry):
    response = json.loads(entry['message'])['message']
    return response

browser_log = driver.get_log('performance') 
events = [process_browser_log_entry(entry) for entry in browser_log]
events = [event for event in events if 'Network.response' in event['method']]

But I only get some of the headers, they look like this,但我只得到一些标题,它们看起来像这样,


{'method': 'Network.responseReceivedExtraInfo',
  'params': {'blockedCookies': [],
   'headers': {'Cache-Control': 'private',
    'Connection': 'Keep-Alive',
    'Content-Encoding': 'gzip',
    'Content-Security-Policy': "frame-ancestors 'self'",
    'Content-Type': 'text/html;charset=UTF-8',
    'Date': 'Mon, 27 Sep 2021 18:18:10 GMT',
    'Keep-Alive': 'timeout=5, max=100',
    'P3P': 'CP="CAO PSA OUR"',
    'Server': 'Taleo Web Server 8',
    'Set-Cookie': 'locale=en; path=/careersection/; secure; HttpOnly',
    'Transfer-Encoding': 'chunked',
    'Vary': 'Accept-Encoding',
    'X-Content-Type-Options': 'nosniff',
    'X-UA-Compatible': 'IE=edge',
    'X-XSS-Protection': '1'},
   'headersText': 'HTTP/1.1 200 OK\r\nDate: Mon, 27 Sep 2021 18:18:10 GMT\r\nServer: Taleo Web Server 8\r\nCache-Control: private\r\nP3P: CP="CAO PSA OUR"\r\nContent-Encoding: gzip\r\nVary: Accept-Encoding\r\nX-Content-Type-Options: nosniff\r\nSet-Cookie: locale=en; path=/careersection/; secure; HttpOnly\r\nContent-Security-Policy: frame-ancestors \'self\'\r\nX-XSS-Protection: 1\r\nX-UA-Compatible: IE=edge\r\nKeep-Alive: timeout=5, max=100\r\nConnection: Keep-Alive\r\nTransfer-Encoding: chunked\r\nContent-Type: text/html;charset=UTF-8\r\n\r\n',
   'requestId': '1E3CDDE80EE37825EF2D9C909FFFAFF3',
   'resourceIPAddressSpace': 'Public'}},
 {'method': 'Network.responseReceived',
  'params': {'frameId': '1624E6F3E724CA508A6D55D556CBE198',
   'loaderId': '1E3CDDE80EE37825EF2D9C909FFFAFF3',
   'requestId': '1E3CDDE80EE37825EF2D9C909FFFAFF3',
   'response': {'connectionId': 26,

They don't contain all the information I can see from the web inspector in chrome.它们不包含我可以从 web inspector in chrome 中看到的所有信息。

I want to get the whole response and request headers and the actual response.我想获得整个响应和请求标头以及实际响应。 Is this the correct way?这是正确的方法吗? Is there another better way which doesn't use selenium and only requests instead?是否有另一种更好的方法不使用 selenium 而只请求?

You can use the selenium-wire library if you want to use Selenium to work with this.如果你想使用Selenium来处理这个,你可以使用selenium-wire库。 However, if you're only concerned for a specific API, then rather than using Selenium, you can use the requests library for hitting the API and then print the results of the request and response headers.但是,如果您只关心特定的 API,则可以使用requests库来命中 API,然后打印requestresponse标头的结果,而不是使用 Selenium。

Considering you're looking for the earlier, using the Selenium way, one way to achieve this is using selenium-wire library.考虑到您正在寻找更早的方法,使用 Selenium 方式,实现此目的的一种方法是使用selenium-wire库。 However, it will give the result for all the background API's/requests being hit - which you can then easily filter after either piping the result to a text file or in terminal itself但是,它会给出所有被命中的后台 API/请求的结果——然后您可以在将结果通过管道传输到文本文件或终端本身后轻松地对其进行过滤

Install using pip install selenium-wire使用pip install selenium-wire

Install webdriver-manager using pip install webdriver-manager使用pip install webdriver-manager webdriver-manager

Install Selenium 4 using pip install selenium==4.0.0.b4安装 Selenium 4 使用pip install selenium==4.0.0.b4

Use this code使用此代码

from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager

from selenium.webdriver.chrome.service import Service

svc=  Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=svc)
driver.maximize_window()
# To use firefox browser
driver.get("https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en")
for request in driver.requests:
    if request.response:
        print(
            request.url,
            request.response.status_code,
            request.headers,
            request.response.headers

        )

which gives a detailed output of all the requests - copying the relavent one -它给出了所有请求的详细 output - 复制相关的请求 -

https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en 200 


Host: epco.taleo.net
Connection: keep-alive
sec-ch-ua: "Chromium";v="94", "Google Chrome";v="94", ";Not A Brand";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8


Date: Tue, 28 Sep 2021 11:14:14 GMT
Server: Taleo Web Server 8
Cache-Control: private
P3P: CP="CAO PSA OUR"
Content-Encoding: gzip
Vary: Accept-Encoding
X-Content-Type-Options: nosniff
Set-Cookie: locale=en; path=/careersection/; secure; HttpOnly
Content-Security-Policy: frame-ancestors 'self'
X-XSS-Protection: 1
X-UA-Compatible: IE=edge
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html;charset=UTF-8

You can use JS in selenium. So this will be easier:你可以在selenium中使用JS。这样会更容易:

var req = new XmlHttpRequest();
req.open("get", url_address_string);
req.send();
// when you get your data then:
x.getAllResponseHeaders();

XmlHttpRequest is async, so you need some code to consume answer. XmlHttpRequest 是异步的,因此您需要一些代码来使用答案。

Ok here you go:好的,你 go:

from selenium import webdriver
import time

driver = webdriver.Chrome()
driver.get("https://stackoverflow.com")
headers = driver.execute_script(""" var xhr = new XMLHttpRequest();
var rH;

xhr.addEventListener('loadend', (ev) => {
     window.rH =  xhr.getAllResponseHeaders(); // <-- we assing headers to proprty in window object so then we can use it
    console.log(rH);
    return rH;
})

xhr.open("get", "https://stackoverflow.com/")
xhr.send()
""")
# need to wait bcoz xhr request is async, this is dirty don't do this ;)
time.sleep(5)
# and we now can extract our 'rH' property from window. With javascript
headers = driver.execute_script("""return window.rH""")
# <--- "accept-ranges: bytes\r\ncache-control: private\r\ncontent-encoding: gzip\r\ncontent-security-policy: upgrade-insecure-requests; ....
print(headers)
# headers arejust string but parts of it are separated with \r\n so you need to
# headers.split("\r\n")
# then you will find a list

EDIT 2: You actually dont want HEADERS.编辑 2:您实际上不需要标题。 When your browser go to desired url, one of the resposne create variable for this page: _ftl当你的浏览器 go 变为所需的 url 时,响应之一为此页面创建变量: _ftl

When you open Dev tools -> console and type "_ftl" you will se object. Now you want to access it.当您打开 Dev tools -> console 并输入“_ftl”时,您将看到 object。现在您想要访问它。 But this is not that easy - _ftl is deep nested object. So you must pick property of it and try to access.但这并不是那么容易 - _ftl 是深层嵌套的 object。因此您必须选择它的属性并尝试访问。 Like: a = driver.execute_script("return window._ftl._acts") Will result in:如: a = driver.execute_script("return window._ftl._acts")将导致: 在此处输入图像描述

But accessing data will be hard task, _ftl is nested object and selenium js serializer can't handle it automatically.但是访问数据将是一项艰巨的任务, _ftl嵌套了 object 和 selenium js 序列化程序无法自动处理它。

So another answer:所以另一个答案:

import requests
from bs4 import BeautifulSoup

url = "https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en"

g = requests.get(url)

soup = BeautifulSoup(g.text)
ftl_script = soup.find_all('script')[-1]
data_you_need =ftl_script.text

But this will result in raw string.但这将导致原始字符串。 You still have to find a way how to process it.您仍然必须找到一种方法来处理它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM