正則表達式查找 HTTP 響應代碼號

Question

我是正則表達式的新手。 我遇到了一個問題，我需要提取示例文本中給出的 HTTP 響應代碼。 但我不太能夠找出正確的正則表達式來應用 re.findall。 我的代碼如下：

import os
import re
sample_text=['199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ 
 HTTP/1.0" 200 6245',
'unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ 
HTTP/1.0" 200 3985',
'199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts- 
 73/mission-sts-73.html HTTP/1.0" 200 4085',
 'burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET / 
shuttle/countdown/liftoff.html HTTP/1.0" 304 0',
 '199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73- 
patch-small.gif HTTP/1.0" 200 4179']

def func():
 r=str(sample_text)
 regext="(\s\d+)(?!.*\d$)"
 content_size=re.findall(regext,r)
 print(content_size)

輸出應該只提取 HTTP 之后的結束編號，例如 6245、786 和 4085。但我上面的代碼在輸出中也包含狀態代碼 200。 我該如何防止呢？ 任何幫助將不勝感激。 提前致謝。

Answer 1

您正在使用r=str(sample_text)創建一個字符串，現在該字符串以']結尾

然后只有 1 個使用$的字符串結尾，您將獲得多個匹配項，因為前瞻在更多位置為真。 在這里查看比賽

例如，您可以使用換行符加入，使用將由 re.findall 返回的捕獲組，並將re.M用於多行。

\bHTTP/\d\.\d"\s\d+\s(\d+)$

模式匹配：

\\bHTTP/匹配HTTP/
\\d\\.\\d"\\s\\d+\\s匹配一個數字. digit whitespace char 1+ digits and whitespace char
(\\d+)在第 1 組中捕獲 1+ 位
$字符串結尾

查看Regex 演示和Python 演示。

import re

sample_text = ['199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/HTTP/1.0" 200 6245',
               'unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/HTTP/1.0" 200 3985',
               '199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085',
               'burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0',
               '199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179']

def func():
    r = "\n".join(sample_text)
    regext = r'\bHTTP/\d\.\d"\s\d+\s(\d+)$'
    content_size = re.findall(regext, r, re.M)
    print(content_size)
func()

輸出

['6245', '3985', '4085', '0', '4179']

或者使用列表理解

def func():
    return [m.group(1) for m in (re.search(r'\bHTTP/\d\.\d"\s\d+\s(\d+)$', s) for s in sample_text) if m]

Answer 2

您可以使用這種模式： (?<=HTTP\\/[12]\\.0\\"\\s)\\d+\\s(\\d+)

解釋

(?<=稱為正向look behind ，它向后看並檢查其中的模式是否在當前位置的后面？如果是，則繼續匹配（注意：它只是檢查其中的模式不匹配）
HTTP\\/完全匹配 HTTP/
[12]匹配這兩個數字中的一個（為 HTTP 2 添加了注 2）
\\.0\\"匹配 .0"
\\s匹配任何空白
\\d+一位或多位數字（此部分與 HTTP 狀態碼匹配）
\\s空白
(\\d+)一個或多個數字並在一組中捕獲


import re

pattern = "(?<=HTTP\/[12]\.0\"\s)\d+\s(\d+)"

text = """
'199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ 
 HTTP/1.0" 200 6245',
'unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ 
HTTP/1.0" 200 3985',
'199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts- 
 73/mission-sts-73.html HTTP/1.0" 200 4085',
 'burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET / 
shuttle/countdown/liftoff.html HTTP/1.0" 304 0',
 '199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73- 
patch-small.gif HTTP/1.0" 200 4179'
"""

print(re.findall(pattern, text))

Output:
['6245', '3985', '4085', '0', '4179']

Answer 3

只需拆分字符串並收集最后一個元素

sample_text = ['199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245',
               'unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985']

values = [entry.split()[-1] for entry in sample_text]
print(values)

輸出

['6245', '3985']

Answer 4

仔細想想你想要什么。

目前，您的正則表達式返回任何數字，后跟一個空格(\\s\\d+) ，而第二組中的確切序列不跟在空格(\\s\\d+)之后（即全部）。

你想這樣寫： \\s(\\d+)\\n

\\s : 匹配一個空格

(\\d+) ：匹配數字並返回

\\n : 確保行結束

Answer 5

我有一個后續問題。 如何僅提取生成代碼的 IP 或機器名稱。 例如：199.72.81.55/unicomp6.unicomp.net

正則表達式查找 HTTP 響應代碼號

問題描述

4 個解決方案

解決方案1
1 2021-09-03 12:16:55

解決方案2
0 2021-09-03 12:04:32

解決方案3
0 2021-09-03 12:08:02

解決方案4
0 2021-09-03 12:09:56

解決方案5
0 2022-01-11 04:35:37

正則表達式查找 HTTP 響應代碼號

問題描述

4 個解決方案

解決方案1 1 2021-09-03 12:16:55

解決方案2 0 2021-09-03 12:04:32

解決方案3 0 2021-09-03 12:08:02

解決方案4 0 2021-09-03 12:09:56

解決方案5 0 2022-01-11 04:35:37

解決方案1
1 2021-09-03 12:16:55

解決方案2
0 2021-09-03 12:04:32

解決方案3
0 2021-09-03 12:08:02

解決方案4
0 2021-09-03 12:09:56

解決方案5
0 2022-01-11 04:35:37