[英]Parse raw HTTP Headers
我有一個原始 HTTP 字符串,我想表示對象中的字段。 有什么方法可以解析 HTTP 字符串中的各個標頭?
'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\nAccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n
[...]'
更新:現在是 2019 年,所以我已經為 Python 3 重寫了這個答案,遵循一個試圖使用代碼的程序員的困惑評論。 原始 Python 2 代碼現在位於答案的底部。
標准庫中有很好的工具,既可用於解析 RFC 821 標頭,也可用於解析整個 HTTP 請求。 這是一個示例請求字符串(請注意,Python 將其視為一個大字符串,即使我們為了可讀性將其分成幾行),我們可以將其提供給我的示例:
request_text = (
b'GET /who/ken/trust.html HTTP/1.1\r\n'
b'Host: cm.bell-labs.com\r\n'
b'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n'
b'Accept: text/html;q=0.9,text/plain\r\n'
b'\r\n'
)
正如@TryPyPy 指出的那樣,您可以使用 Python 的電子郵件消息庫來解析標頭——盡管我們應該添加,一旦您完成創建,生成的Message
對象就像一個標頭字典:
from email.parser import BytesParser
request_line, headers_alone = request_text.split(b'\r\n', 1)
headers = BytesParser().parsebytes(headers_alone)
print(len(headers)) # -> "3"
print(headers.keys()) # -> ['Host', 'Accept-Charset', 'Accept']
print(headers['Host']) # -> "cm.bell-labs.com"
但這當然會忽略請求行,或者讓您自己解析它。 事實證明,有一個更好的解決方案。
如果您使用其BaseHTTPRequestHandler
,標准庫將為您解析 HTTP。 盡管它的文檔有點晦澀——標准庫中的整套 HTTP 和 URL 工具存在問題——但要讓它解析字符串,你所要做的就是 (a) 將你的字符串包裝在BytesIO()
中,(b ) 讀取raw_requestline
以便它准備好被解析,並且 (c) 捕獲在解析過程中發生的任何錯誤代碼,而不是讓它嘗試將它們寫回客戶端(因為我們沒有!)。
所以這是我們對標准庫類的特化:
from http.server import BaseHTTPRequestHandler
from io import BytesIO
class HTTPRequest(BaseHTTPRequestHandler):
def __init__(self, request_text):
self.rfile = BytesIO(request_text)
self.raw_requestline = self.rfile.readline()
self.error_code = self.error_message = None
self.parse_request()
def send_error(self, code, message):
self.error_code = code
self.error_message = message
再一次,我希望標准庫的人已經意識到 HTTP 解析應該以一種不需要我們編寫 9 行代碼來正確調用它的方式進行分解,但是你能做什么呢? 下面是如何使用這個簡單的類:
# Using this new class is really easy!
request = HTTPRequest(request_text)
print(request.error_code) # None (check this first)
print(request.command) # "GET"
print(request.path) # "/who/ken/trust.html"
print(request.request_version) # "HTTP/1.1"
print(len(request.headers)) # 3
print(request.headers.keys()) # ['Host', 'Accept-Charset', 'Accept']
print(request.headers['host']) # "cm.bell-labs.com"
如果解析過程中出現錯誤, error_code
不會是None
:
# Parsing can result in an error code and message
request = HTTPRequest(b'GET\r\nHeader: Value\r\n\r\n')
print(request.error_code) # 400
print(request.error_message) # "Bad request syntax ('GET')"
我更喜歡像這樣使用標准庫,因為我懷疑如果我嘗試使用正則表達式自己重新實現 Internet 規范,他們已經遇到並解決了可能會咬我的任何邊緣情況。
這是我第一次寫這個答案的原始代碼:
request_text = (
'GET /who/ken/trust.html HTTP/1.1\r\n'
'Host: cm.bell-labs.com\r\n'
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n'
'Accept: text/html;q=0.9,text/plain\r\n'
'\r\n'
)
和:
# Ignore the request line and parse only the headers
from mimetools import Message
from StringIO import StringIO
request_line, headers_alone = request_text.split('\r\n', 1)
headers = Message(StringIO(headers_alone))
print len(headers) # -> "3"
print headers.keys() # -> ['accept-charset', 'host', 'accept']
print headers['Host'] # -> "cm.bell-labs.com"
和:
from BaseHTTPServer import BaseHTTPRequestHandler
from StringIO import StringIO
class HTTPRequest(BaseHTTPRequestHandler):
def __init__(self, request_text):
self.rfile = StringIO(request_text)
self.raw_requestline = self.rfile.readline()
self.error_code = self.error_message = None
self.parse_request()
def send_error(self, code, message):
self.error_code = code
self.error_message = message
和:
# Using this new class is really easy!
request = HTTPRequest(request_text)
print request.error_code # None (check this first)
print request.command # "GET"
print request.path # "/who/ken/trust.html"
print request.request_version # "HTTP/1.1"
print len(request.headers) # 3
print request.headers.keys() # ['accept-charset', 'host', 'accept']
print request.headers['host'] # "cm.bell-labs.com"
和:
# Parsing can result in an error code and message
request = HTTPRequest('GET\r\nHeader: Value\r\n\r\n')
print request.error_code # 400
print request.error_message # "Bad request syntax ('GET')"
mimetools
自 Python 2.3 以來已被棄用,並已從 Python 3 (鏈接) 中完全刪除。
以下是在 Python 3 中應該如何操作:
import email
import io
import pprint
# […]
request_line, headers_alone = request_text.split('\r\n', 1)
message = email.message_from_file(io.StringIO(headers_alone))
headers = dict(message.items())
pprint.pprint(headers, width=160)
如果您剝離GET
行,這似乎可以正常工作:
import mimetools
from StringIO import StringIO
he = "Host: www.google.com\r\nConnection: keep-alive\r\nAccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n"
m = mimetools.Message(StringIO(he))
print m.headers
解析示例並將第一行中的信息添加到對象的一種方法是:
import mimetools
from StringIO import StringIO
he = 'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\n'
# Pop the first line for further processing
request, he = he.split('\r\n', 1)
# Get the headers
m = mimetools.Message(StringIO(he))
# Add request information
m.dict['method'], m.dict['path'], m.dict['http-version'] = request.split()
print m['method'], m['path'], m['http-version']
print m['Connection']
print m.headers
print m.dict
使用 python3.7、 urllib3.HTTPResponse
、 http.client.parse_headers
,並在此處使用 curl 標志說明:
curl -i -L -X GET "http://httpbin.org/relative-redirect/3" | python -c '
import sys
from io import BytesIO
from urllib3 import HTTPResponse
from http.client import parse_headers
rawresponse = sys.stdin.read().encode("utf8")
redirects = []
while True:
header, body = rawresponse.split(b"\r\n\r\n", 1)
if body[:4] == b"HTTP":
redirects.append(header)
rawresponse = body
else:
break
f = BytesIO(header)
# read one line for HTTP/2 STATUSCODE MESSAGE
requestline = f.readline().split(b" ")
protocol, status = requestline[:2]
headers = parse_headers(f)
resp = HTTPResponse(body, headers=headers)
resp.status = int(status)
print("headers")
print(resp.headers)
print("redirects")
print(redirects)
'
輸出:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 215 100 215 0 0 435 0 --:--:-- --:--:-- --:--:-- 435
headers
HTTPHeaderDict({'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Date': 'Thu, 20 Sep 2018 05:39:25 GMT', 'Content-Type': 'application/json', 'Content-Length': '215', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true', 'Via': '1.1 vegur'})
redirects
[b'HTTP/1.1 302 FOUND\r\nConnection: keep-alive\r\nServer: gunicorn/19.9.0\r\nDate: Thu, 20 Sep 2018 05:39:24 GMT\r\nContent-Type: text/html; charset=utf-8\r\nContent-Length: 0\r\nLocation: /relative-redirect/2\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\nVia: 1.1 vegur',
b'HTTP/1.1 302 FOUND\r\nConnection: keep-alive\r\nServer: gunicorn/19.9.0\r\nDate: Thu, 20 Sep 2018 05:39:24 GMT\r\nContent-Type: text/html; charset=utf-8\r\nContent-Length: 0\r\nLocation: /relative-redirect/1\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\nVia: 1.1 vegur',
b'HTTP/1.1 302 FOUND\r\nConnection: keep-alive\r\nServer: gunicorn/19.9.0\r\nDate: Thu, 20 Sep 2018 05:39:24 GMT\r\nContent-Type: text/html; charset=utf-8\r\nContent-Length: 0\r\nLocation: /get\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\nVia: 1.1 vegur']
筆記:
以pythonic的方式
request_text = (
b'GET /who/ken/trust.html HTTP/1.1\r\n'
b'Host: cm.bell-labs.com\r\n'
b'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n'
b'Accept: text/html;q=0.9,text/plain\r\n'
b'\r\n'
)
print({ k:v.strip() for k,v in [line.split(":",1)
for line in request_text.decode().splitlines() if ":" in line]})
它們是另一種處理標頭的更簡單和更安全的方法。 更面向對象。 無需手動解析。
簡短的演示。
1. 解析它們
從str
, bytes
, fp
, dict
, requests.Response
, email.Message
, httpx.Response
, urllib3.HTTPResponse
。
from requests import get
from kiss_headers import parse_it
response = get('https://www.google.fr')
headers = parse_it(response)
headers.content_type.charset # output: ISO-8859-1
# Its the same as
headers["content-type"]["charset"] # output: ISO-8859-1
2. 建造它們
這個
from kiss_headers import *
headers = (
Host("developer.mozilla.org")
+ UserAgent(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:50.0) Gecko/20100101 Firefox/50.0"
)
+ Accept("text/html")
+ Accept("application/xhtml+xml")
+ Accept("application/xml", qualifier=0.9)
+ Accept(qualifier=0.8)
+ AcceptLanguage("en-US")
+ AcceptLanguage("en", qualifier=0.5)
+ AcceptEncoding("gzip")
+ AcceptEncoding("deflate")
+ AcceptEncoding("br")
+ Referer("https://developer.mozilla.org/testpage.html")
+ Connection(should_keep_alive=True)
+ UpgradeInsecureRequests()
+ IfModifiedSince("Mon, 18 Jul 2016 02:36:04 GMT")
+ IfNoneMatch("c561c68d0ba92bbeb8b0fff2a9199f722e3a621a")
+ CacheControl(max_age=0)
)
raw_headers = str(headers)
會變成
Host: developer.mozilla.org
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:50.0) Gecko/20100101 Firefox/50.0
Accept: text/html, application/xhtml+xml, application/xml; q="0.9", */*; q="0.8"
Accept-Language: en-US, en; q="0.5"
Accept-Encoding: gzip, deflate, br
Referer: https://developer.mozilla.org/testpage.html
Connection: keep-alive
Upgrade-Insecure-Requests: 1
If-Modified-Since: Mon, 18 Jul 2016 02:36:04 GMT
If-None-Match: "c561c68d0ba92bbeb8b0fff2a9199f722e3a621a"
Cache-Control: max-age="0"
Kiss-headers 庫的文檔。
在python3中
from email import message_from_string
data = socket.recv(4096)
headers = message_from_string(str(data, 'ASCII').split('\r\n', 1)[1])
print(headers['Host'])
有什么方法可以解析 HTTP 字符串中的各個標頭?
我寫了一個可以返回字典對象的簡單函數,希望對你有所幫助。 ^_^
蟒蛇 3
def parse_request(request):
raw_list = request.split("\r\n")
request = {}
for index in range(1, len(raw_list)):
item = raw_list[index].split(":")
if len(item) == 2:
request.update({item[0].lstrip(' '): item[1].lstrip(' ')})
return request
raw_request = 'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\nAccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n'
request = parse_request(raw_request)
print(request)
print('\n')
print(request.keys())
輸出:
{'Host': 'www.google.com', 'Connection': 'keep-alive', 'Accept': 'application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13', 'Accept-Encoding': 'gzip,deflate,sdch', 'Avail-Dictionary': 'GeNLY2f-', 'Accept-Language': 'en-US,en;q=0.8'}
dict_keys(['Host', 'Connection', 'Accept', 'User-Agent', 'Accept-Encoding', 'Avail-Dictionary', 'Accept-Language'])
從這個問題: How to parse raw HTTP request in Python 3?
以下是一些旨在正確解析 HTTP 協議的 Python 包:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.