如何在 python 中的 selenium chrome web 驱动程序中获取状态代码

Question

I am looking for status_code in selenium but can't find any code that suits my need.我在 selenium 中寻找 status_code 但找不到任何适合我需要的代码。 My other problem is that when I enter a domain which does not exists lets say https://gghgjeggeg.com .我的另一个问题是，当我输入一个不存在的域时，可以说https://gghgjeggeg.com 。 Selenium does not raises any eror. Selenium 不会引发任何错误。 It's page source is like:-它的页面来源是这样的：-

<html><head></head><body></body></html>

How can I get status code(for valid domains eg: https://twiitter.com/404errpage ) as well as raise error for non existing domains in Selenium or is there any other library like Selenium?如何获取状态代码（对于有效域，例如： https://twiitter.com/404errpage ）以及为 Selenium 中不存在的域引发错误，或者是否有任何其他库，例如 Selenium？

Answer 1

For Firefox or Chrome u can use addons for this.对于 Firefox 或 Chrome，您可以为此使用插件。 We save status code in response cookies and read this cookie on selenium side.我们将状态代码保存在响应 cookie 中，并在 selenium 端读取此 cookie。

U can read more about browser extensions here:您可以在此处阅读有关浏览器扩展的更多信息：

Chrome: https://developer.chrome.com/extensions/getstarted铬： https : //developer.chrome.com/extensions/getstarted

Firefox: https://developer.mozilla.org/en-US/docs/Web/Tutorials火狐： https : //developer.mozilla.org/en-US/docs/Web/Tutorials

NOTE: (Not certificated addons works only with Firefox Dev version, if u want use standard Firefox u must certificate your extension on firefox site.)注意：（未认证的插件仅适用于 Firefox Dev 版本，如果您想使用标准的 Firefox，您必须在 Firefox 站点上认证您的扩展。）

Chrome version铬版

//your_js_file_with_extension.js

var targetPage = "*://*/*";

function setStatusCodeDiv(e) {
    chrome.cookies.set({
        url: e.url,
        name: 'status-code',
        value: `${e.statusCode}`
    });
}

chrome.webRequest.onCompleted.addListener(
  setStatusCodeDiv,
  {urls: [targetPage], types: ["main_frame"]}
);

manifest:显现：

{
  "description": "Save http status code in site cookies",
  "manifest_version": 2,
  "name": "StatusCodeInCookies",
  "version": "1.0",
  "permissions": [
    "webRequest", "*://*/*", "cookies"
  ],
  "background": {
    "scripts": [ "your_js_file_with_extension.js" ]
  }
}

Firefox version is almost the same. Firefox 版本几乎相同。

//your_js_file_with_extension.js

var targetPage = "*://*/*";

function setStatusCodeDiv(e) {
  browser.cookies.set({
    url: e.url,
    name: 'status-code',
    value: `${e.statusCode}`
  });
}

browser.webRequest.onCompleted.addListener(
  setStatusCodeDiv,
  {urls: [targetPage], types: ["main_frame"]}
);

Manifest:显现：

{
  "description": "Save http status code in site cookies",
  "manifest_version": 2,
  "name": "StatusCodeInCookies",
  "version": "1.0",
  "permissions": [
    "webRequest", "*://*/*", "cookies"
  ],

  "background": {
    "scripts": [ "your_js_file_with_extension.js" ]
  },

  "applications": {
    "gecko": {
      "id": "some_id"
    }
  }
}

Next u must build this extensions:接下来你必须构建这个扩展：

For Chrome u must create *.pem and *.crx files (powershell script):对于 Chrome，您必须创建 *.pem 和 *.crx 文件（powershell 脚本）：

start-Process -FilePath "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"  -ArgumentList "--pack-extension=C:\Path\to\your\js\and\manifest"

Firefox (We need only zip file): Firefox（我们只需要 zip 文件）：

[io.compression.zipfile]::CreateFromDirectory('C:\Path\to\your\js\and\manifest', 'destination\folder')

Selenium steps硒步骤

Ok, when we have extension we can add this to our selenium app.好的，当我们有扩展时，我们可以将它添加到我们的 selenium 应用程序中。 I write our version in C# but i think is easy to rewrite this to other languages (Here u can find Python ver: Using Extensions with Selenium (Python) ).我用 C# 编写我们的版本，但我认为很容易将其重写为其他语言（在这里你可以找到 Python 版本： Using Extensions with Selenium (Python) ）。

Load extension with Chrome drive:使用 Chrome 驱动器加载扩展程序：

var options = new ChromeOptions();
options.AddExtension(Path.Combine(System.Environment.CurrentDirectory,@"Selenium\BrowsersExtensions\Compiled\YOUR_CHROME_EXTENSION.crx"));
var chromeDriver = new ChromeDriver(ChromeDriverService.CreateDefaultService(), options);

Load with Firefox (U must use profile):使用 Firefox 加载（您必须使用配置文件）：

var profile = new FirefoxProfile();       
profile.AddExtension(Path.Combine(System.Environment.CurrentDirectory,@"Selenium\BrowsersExtensions\Compiled\YOUR_FIREFOX_EXTENSION.zip"));
var options = new FirefoxOptions
{
    Profile = profile
};
var firefoxDriver = new FirefoxDriver(FirefoxDriverService.CreateDefaultService(), options);

Ok we almost done, now we need read status code from cookies, this should looks something like:好的，我们差不多完成了，现在我们需要从 cookie 中读取状态代码，这应该类似于：

webDriver.Navigate().GoToUrl('your_url');
if (webDriver.Manage() is IOptions options 
    && options.Cookies.GetCookieNamed("status-code") is Cookie cookie
    && int.TryParse(cookie.Value, out var statusCode))
{
    //we delete cookies after we read status code but this is not necessary
    options.Cookies.DeleteCookieNamed("status-code");
    return statusCode;
}
logger.Warn($"Can't get http status code from {webDriver.Url}");
return 500;

And this is all.这就是全部。 I have not seen anywhere answer like this.我在任何地方都没有看到这样的答案。 Hope I helped.希望我有所帮助。

Answer 2

Selenium is not meant to be used to directly examine HTTP status codes. Selenium 不打算用于直接检查 HTTP 状态代码。 Selenium is used to interact with the website like a user would do. Selenium 用于像用户一样与网站交互。 And the typical user would not open the developer tools and observe the HTTP status code but look at the page content.而一般用户不会打开开发者工具查看HTTP状态码，而是查看页面内容。

I even saw pages responding with a HTTP 200 OK delivering a "resource not found" message to the user.我什至看到页面响应 HTTP 200 OK 向用户传递“找不到资源”消息。

Even the Selenium developers addressed this:甚至 Selenium 开发人员也解决了这个问题：

The browser will always represent the HTTP status code, imagine for example a 404 or a 500 error page.浏览器将始终表示 HTTP 状态代码，例如 404 或 500 错误页面。 A simple way to “fail fast” when you encounter one of these error pages is to check the page title or content of a reliable point (eg the <h1> tag) after every page load.当您遇到这些错误页面之一时，“快速失败”的一种简单方法是在每个页面加载后检查页面标题或可靠点（例如<h1>标签）的内容。

Source: selenium.dev / Worst practices / HTTP response codes来源： selenium.dev / 最糟糕的做法 / HTTP 响应代码

If you insist using Selenium you're better off finding the first h1 element and looking for the typical Chrome 404 signature:如果您坚持使用 Selenium，您最好找到第一个h1元素并寻找典型的 Chrome 404 签名：

h1 = driver.find_element_by_css_selector('h1')
if h1.text == u"This site can’t be reached":
    print("Not found")

Although, if you want to crawl websites, you might even use urllib, like Tek Nath suggested in the comments:虽然，如果你想抓取网站，你甚至可以使用 urllib，就像评论中建议的 Tek Nath：

import urllib.request
import urllib.request
import urllib.error

try:
    with urllib.request.urlopen('http://www.safasdfsadfsadfdsf.org/') as f:
        print(f.read())
        print(f.status)
        print(f.getheader("content-length"))
except urllib.error.URLError as e:
    print(e.reason)

Since the domain is not existing, the code will run into the exception handler branch.由于域不存在，代码将运行到异常处理程序分支。

See the Python documentation for details and more examples:有关详细信息和更多示例，请参阅 Python 文档：

urllib API urllib API
HTTPResponse object interface HTTPResponse对象接口

You might then want to use a DOM parser to process the HTML markup to a DOM tree for easier processing.然后，您可能希望使用 DOM 解析器将 HTML 标记处理为 DOM 树，以便于处理。 Though this is beyond this question - get started here:虽然这超出了这个问题 - 从这里开始：

xml.dom (Python documentation) xml.dom （Python 文档）
"Python: Is there a built in package to parse html into dom" (Stackoverflow) “Python：是否有内置包可以将 html 解析为 dom” （Stackoverflow）

如何在 python 中的 selenium chrome web 驱动程序中获取状态代码

问题描述

2 个解决方案

解决方案1
2 2020-04-06 12:07:50

解决方案2
1 已采纳 2019-12-29 15:29:22

如何在 python 中的 selenium chrome web 驱动程序中获取状态代码

问题描述

2 个解决方案

解决方案1 2 2020-04-06 12:07:50

解决方案2 1 已采纳 2019-12-29 15:29:22

解决方案1
2 2020-04-06 12:07:50

解决方案2
1 已采纳 2019-12-29 15:29:22