简体   繁体   中英

python3: extract IP address from compiled pattern

I want to process every line in my log file, and extract IP address if line matches my pattern. There are several different types of messages, in example below I am using p1 and p2`.

I could read the file line by line, and for each line match to each pattern. But Since there can be many more patterns, I would like to do it as efficiently as possible. I was hoping to compile thos patterns into one object, and do the match only once for each line:

import re

IP = r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'

p1 = 'Registration from' + IP + '- Wrong password' 
p2 = 'Call from' + IP + 'rejected because extension not found'

c = re.compile(r'(?:' + p1 + '|' + p2 + ')')

for line in sys.stdin:
    match = re.search(c, line)
    if match:
        print(match['ip'])

but the above code does not work, it complains that ip is used twice.

What is the most elegant way to achieve my goal?

EDIT:

I have modified my code based on answer from @Dev Khadka.

But I am still struggling with how to properly handle the multiple ip matches. The code below prints all IPs that matched p1:

for line in sys.stdin:
    match = c.search(line)
    if match:
        print(match['ip1'])

But some lines don't match p1 . They match p2 . ie, I get:

1.2.3.4
None
2.3.4.5
...

How do I print the matching ip, when I don't know wheter it was p1 , p2 , ... ? All I want is the IP. I don't care which pattern it matched.

You can consider installing the excellent regex module, which supports many advanced regex features, including branch reset groups , designed to solve exactly the problem you outlined in this question. Branch reset groups are denoted by (?|...) . All capture groups of the same positions or names in different alternative patterns within a branch reset grouop share the same capture groups for output.

Notice that in the example below the matching capture group becomes the named capture group, so that you don't need to iterate over multiple groups searching for a non-empty group:

import regex

ip_pattern = r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
patterns = [
    'Registration from {ip} - Wrong password',
    'Call from {ip} rejected because extension not found'
]
pattern = regex.compile('(?|%s)' % '|'.join(patterns).format(ip=ip_pattern))
for line in sys.stdin:
    match = regex.search(pattern, line)
    if match:
        print(match['ip'])

Demo: https://repl.it/@blhsing/RegularEmbellishedBugs

why don't you check which regex matched?

if 'ip1' in match :
    print match['ip1']
if 'ip2' in match :
    print match['ip2']

or something like:

names = [ 'ip1', 'ip2', 'ip3' ]
for n in names :
    if n in match :
        print match[n]

or even

num = 1000   # can easily handle millions of patterns =)
for i in range(num) :
    name = 'ip%d' % i
    if name in match :
        print match[name]

thats because you are using same group name for two group

try this, this will give group names ip1 and ip2

import re

IP = r'(?P<ip%d>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'

p1 = 'Registration from' + IP%1 + '- Wrong password' 
p2 = 'Call from' + IP%2 + 'rejected because extension not found'

c = re.compile(r'(?:' + p1 + '|' + p2 + ')')

Named capture groups must have distinct names, but since all of your capture groups are meant to capture the same pattern, it's better not to use named capture groups in this case but instead simply use regular capture groups and iterate through the groups from the match object to print the first group that is not empty:

ip_pattern = r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
patterns = [
    'Registration from {ip} - Wrong password',
    'Call from {ip} rejected because extension not found'
]
pattern = re.compile('|'.join(patterns).format(ip=ip_pattern))
for line in sys.stdin:
    match = re.search(pattern, line)
    if match:
        print(next(filter(None, match.groups())))

Demo: https://repl.it/@blhsing/UnevenCheerfulLight

Adding ip address validity to already accepted answer. Altho import ipaddress & import socket should be ideal ways, this code will parse-the-host,

import regex as re 
from io import StringIO



def valid_ip(address):
    try:
        host_bytes = address.split('.')
        valid = [int(b) for b in host_bytes]
        valid = [b for b in valid if b >= 0 and b<=255]
        return len(host_bytes) == 4 and len(valid) == 4
    except:
        return False
    
        
    
        

ip_pattern = r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'

patterns = patterns = [
    'Registration from {ip} - Wrong password',
    'Call from {ip} rejected because extension not found'
] 

file = StringIO('''
Registration from 259.1.1.1 - Wrong password,
    Call from 1.1.2.2 rejected because extension not found
''')

pattern = re.compile('(?|%s)' % '|'.join(patterns).format(ip=ip_pattern))

list1 = []
list2 = []

for line in file:
    match = re.search(pattern, line)
    if match:
        list1.append(match['ip']) # List of ip address 
        list2.append(valid_ip(match['ip'])) # Boolean results of valid_ip 


for i in range(len(list1)):
        if list2[i] == False:
            print(f'{list1[i]} is invalid IP')
        else:
            print(list1[i])
259.1.1.1 is invalid IP
1.1.2.2

[Program finished]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM