简体   繁体   English

如何从HTML字符串中提取IP地址?

[英]How to extract an IP address from an HTML string?

I want to extract an IP address from a string (actually a one-line HTML) using Python. 我想使用Python从字符串(实际上是单行HTML)中提取IP地址。

>>> s = "<html><head><title>Current IP Check</title></head><body>Current IP Address: 165.91.15.131</body></html>"

-- '165.91.15.131' is what I want! - '165.91.15.131'是我想要的!

I tried using regular expressions, but so far I can only get to the first number. 我尝试使用正则表达式,但到目前为止我只能使用第一个数字。

>>> import re
>>> ip = re.findall( r'([0-9]+)(?:\.[0-9]+){3}', s )
>>> ip
['165']

But I don't have a firm grasp on reg-expression; 但我对reg-expression缺乏把握; the above code was found and modified from elsewhere on the web. 上面的代码是从网上其他地方找到并修改的。

Remove your capturing group: 删除您的捕获组:

ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', s )

Result: 结果:

['165.91.15.131']

Notes: 笔记:

  • If you are parsing HTML it might be a good idea to look at BeautifulSoup . 如果您正在解析HTML,那么查看BeautifulSoup可能是个好主意。
  • Your regular expression matches some invalid IP addresses such as 0.00.999.9999 . 正则表达式匹配一些无效的IP地址,例如0.00.999.9999 This isn't necessarily a problem, but you should be aware of it and possibly handle this situation. 这不一定是个问题,但您应该了解它并可能处理这种情况。 You could change the + to {1,3} for a partial fix without making the regular expression overly complex. 您可以将+更改为{1,3}以进行部分修复,而不会使正则表达式过于复杂。

You can use the following regex to capture only valid IP addresses 您可以使用以下正则表达式仅捕获有效的IP地址

re.findall(r'\b25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?\.25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?\.25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?\.25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?\b',s)

returns 回报

['165', '91', '15', '131']
import re

ipPattern = re.compile('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')

findIP = re.findall(ipPattern,s)

findIP contains ['165.91.15.131']

easiest way to find the ip address from the log.. 从日志中找到IP地址的最简单方法..

 s = "<html><head><title>Current IP Check</title></head><body>Current IP Address: 165.91.15.131</body></html>"
 info = re.findall(r'[\d.-]+', s)

In [42]: info 在[42]中:info

Out[42]: ['165.91.15.131'] 出[42]:['165.91.15.131']

You can use following regex to extract valid IP without following errors 您可以使用以下正则表达式来提取有效的IP而不会出现以下错误
1.Some detected 123.456.789.111 as valid IP 1.有些检测到123.456.789.111为有效IP
2.Some don't detect 127.0.00.1 as valid IP 2.有些不检测127.0.00.1为有效IP
3.Some don't detect IP that start with zero like 08.8.8.8 3.有些人不会像08.8.8.8那样检测以零开头的IP

So here I post a regex that works on all above conditions. 所以在这里我发布一个适用于所有上述条件的正则表达式。

Note : I have extracted more than 2 millions IP without any problem with following regex. 注意:我已经提取了超过2百万个IP而没有任何跟随正则表达式的问题。

(?:(?:1\d\d|2[0-5][0-5]|2[0-4]\d|0?[1-9]\d|0?0?\d)\.){3}(?:1\d\d|2[0-5][0-5]|2[0-4]\d|0?[1-9]\d|0?0?\d)

This is how I've done it. 这就是我做到的。 I think it's so clean 我觉得它太干净了

import re
import urllib2

def getIP():
    ip_checker_url = "http://checkip.dyndns.org/"
    address_regexp = re.compile ('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
    response = urllib2.urlopen(ip_checker_url).read()
    result = address_regexp.search(response)

    if result:
            return result.group()
    else:
            return None

get_IP() returns ip into a string or None get_IP()将ip返回到字符串或None

You can substitute address_regexp for other regular expressions if you prefer a more accurate parsing or maybe change the web service provider. 如果您更喜欢更准确的解析或更改Web服务提供者,则可以将address_regexp替换为其他正则表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM