简体   繁体   English

Python 正则表达式替换使用字典清理域名

[英]Python regex substitution using a dictionary to clean up domain names

For the output, need to replace the brackets contain a digits with periods '.'.对于 output,需要替换括号中包含带句点“.”的数字。 Also remove the brackets at the beginning and end of the domain.还要删除域开头和结尾的括号。

Can we use re.sub for this and if so how?我们可以为此使用re.sub吗?如果可以,如何使用?

code代码

import re

log = ["4/19/2020 11:59:09 PM 2604 PACKET  0000014DE1921330 UDP Rcv 192.168.1.28   f975   Q [0001   D   NOERROR] A      (7)pagead2(17)googlesyndication(3)com(0)",
       "4/19/2020 11:59:09 PM 0574 PACKET  0000014DE18C4720 UDP R cv 192.168.2.54    9c63   Q [0001   D   NOERROR] A      (2)pg(3)cdn(5)viber(3)com(0)"]

rx_dict = { 'query': re.compile(r'(?P<query>[\S]*)$') }

for item in log:
    for key, r_exp in rx_dict.items():
        print(f"{r_exp.search(item).group(1)}")

output output

(7)pagead2(17)googlesyndication(3)com(0)
(2)pg(3)cdn(5)viber(3)com(0)

preferred output首选output

pagead2.googlesyndication.com
pg.cdn.viber.com

Pragmatic python usage:实用 python 用法:

log = ["4/19/2020 11:59:09 PM 2604 PACKET  0000014DE1921330 UDP Rcv 192.168.1.28   f975   Q [0001   D   NOERROR] A      (7)pagead2(17)googlesyndication(3)com(0)",
       "4/19/2020 11:59:09 PM 0574 PACKET  0000014DE18C4720 UDP R cv 192.168.2.54    9c63   Q [0001   D   NOERROR] A      (2)pg(3)cdn(5)viber(3)com(0)"]

import re

urls = [re.sub(r'\(\d+\)','.',t.split()[-1]).strip('.') for t in log]

print (urls)

Output: Output:

['pagead2.googlesyndication.com', 'pg.cdn.viber.com']

Dictionary refinement via rules:通过规则细化字典:

If you want to apply consecutive rules via a dictionary, go lambda all the way:如果你想通过字典应用连续的规则,go lambda一路:

import re 

rules = {"r0": lambda x: x.split()[-1],
         "r1": lambda x: re.sub(r'\(\d+\)','.',x),
         "r2": lambda x: x.strip(".")}

result = []
for value in log:  
    result.append(value)
    for r in rules:
        result[-1] = rules[r](result[-1])

print(result)

Output: Output:

['pagead2.googlesyndication.com', 'pg.cdn.viber.com']

Yes, you can use re.sub .是的,您可以使用re.sub I assume you have this dictionary so you can extract multiple pieces from the log.我假设您有这本字典,因此您可以从日志中提取多个部分。 You can do something like this for dispatch :你可以为dispatch做这样的事情:

ops = {
    "query": lambda e: (
         re.sub(r"\(\d+\(", ".", (
             re.search(r"(?P<query>[\S]*)$", e).group(1),
         )
     ),
     ...
}

And then apply the functions to all log entires然后将这些函数应用于所有日志条目

log_results = {op_name: op(l) for op_name, op in ops.items() for l in log}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM