简体   繁体   English

使用正则表达式提取域

[英]Extract domain using regular expression

Suppose I got these urls.假设我得到了这些网址。

http://abdd.eesfea.domainname.com/b/33tA$/0021/file
http://mail.domainname.org/abc/abc/aaa
http://domainname.edu 

I just want to extract "domainame.com" or "domainname.org" or "domainname.edu" out.我只想提取“domainame.com”或“domainname.org”或“domainname.edu”。 How can I do this?我怎样才能做到这一点?

I think, I need to find the last "dot" just before "com|org|edu..." and print out content from this "dot"'s previous dot to this dot's next dot(if it has).我想,我需要找到“com|org|edu...”之前的最后一个“点”,并将内容从这个“点”的前一个点打印到这个点的下一个点(如果有的话)。

Need help about the regular-expres.需要有关常规快递的帮助。 Thanks a lot!!!非常感谢!!! I am using Python.我正在使用 Python。

If you would like to go the regex route...如果你想走正则表达式路线......

RFC-3986 is the authority regarding URIs. RFC-3986 是有关 URI 的权威。 Appendix B provides this regex to break one down into its components:附录 B提供了此正则表达式以将其分解为多个组件:

re_3986 = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
# Where:
# scheme    = $2
# authority = $4
# path      = $5
# query     = $7
# fragment  = $9

Here is an enhanced, Python friendly version which utilizes named capture groups.这是一个使用命名捕获组的增强型 Python 友好版本。 It is presented in a function within a working script:它显示在工作脚本中的函数中:

import re

def get_domain(url):
    """Return top two domain levels from URI"""
    re_3986_enhanced = re.compile(r"""
        # Parse and capture RFC-3986 Generic URI components.
        ^                                    # anchor to beginning of string
        (?:  (?P<scheme>    [^:/?#\s]+): )?  # capture optional scheme
        (?://(?P<authority>  [^/?#\s]*)  )?  # capture optional authority
             (?P<path>        [^?#\s]*)      # capture required path
        (?:\?(?P<query>        [^#\s]*)  )?  # capture optional query
        (?:\#(?P<fragment>      [^\s]*)  )?  # capture optional fragment
        $                                    # anchor to end of string
        """, re.MULTILINE | re.VERBOSE)
    re_domain =  re.compile(r"""
        # Pick out top two levels of DNS domain from authority.
        (?P<domain>[^.]+\.[A-Za-z]{2,6})  # $domain: top two domain levels.
        (?::[0-9]*)?                      # Optional port number.
        $                                 # Anchor to end of string.
        """, 
        re.MULTILINE | re.VERBOSE)
    result = ""
    m_uri = re_3986_enhanced.match(url)
    if m_uri and m_uri.group("authority"):
        auth = m_uri.group("authority")
        m_domain = re_domain.search(auth)
        if m_domain and m_domain.group("domain"):
            result = m_domain.group("domain");
    return result

data_list = [
    r"http://abdd.eesfea.domainname.com/b/33tA$/0021/file",
    r"http://mail.domainname.org/abc/abc/aaa",
    r"http://domainname.edu",
    r"http://domainname.com:80",
    r"http://domainname.com?query=one",
    r"http://domainname.com#fragment",
    ]
cnt = 0
for data in data_list:
    cnt += 1
    print("Data[%d] domain = \"%s\"" %
        (cnt, get_domain(data)))

For more information regarding the picking apart and validation of a URI according to RFC-3986, you may want to take a look at an article I've been working on: Regular Expression URI Validation有关根据 RFC-3986 选择和验证 URI 的更多信息,您可能需要查看我一直在研究的一篇文章:正则表达式 URI 验证

In addition to Jase' answer.除了Jase的回答。 If you don't wan't to use urlparse, just split the URL's.如果您不想使用 urlparse,只需拆分 URL。

Strip of the protocol (http:// or https://) The you just split the string by first occurrence of '/'.协议条带(http:// 或 https://) 您只需通过第一次出现“/”来拆分字符串。 This will leave you with something like: 'mail.domainname.org' on the second URL.这会给你留下类似:'mail.domainname.org' 在第二个 URL 上的内容。 This can then be split by '.'然后可以用 '.' 分割。 and the you just select the last two from the list by [-2]并且您只需按 [-2] 从列表中选择最后两个

This will always yield the domainname.org or whatever.这将始终产生 domainname.org 或其他任何内容。 Provided you get the protocol stripped out right, and that the URL are valid.前提是您正确剥离了协议,并且 URL 有效。

I would just use urlparse, but it can be done.我只会使用 urlparse,但它可以做到。 Dunno about the regex, but this is how I would do it.不知道正则表达式,但这就是我要做的。

Should you need more flexibility than urlparse provides, here's an example to get you started:如果您需要比urlparse提供的更大的灵活性,这里有一个示例可以帮助您入门:

import re
def getDomain(url):
    #requires 'http://' or 'https://'
    #pat = r'(https?):\/\/(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'
    #'http://' or 'https://' is optional
    pat = r'((https?):\/\/)?(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'
    m = re.match(pat, url)
    if m:
        domain = m.group('domain')
        return domain
    else:
        return False

I used the named group (?P<domain>\\w+) to grab the match, which is then indexed by its name, m.group('domain') .我使用命名组(?P<domain>\\w+)来获取匹配项,然后按其名称m.group('domain')进行索引。 The great thing about learning regular expressions is that once you are comfortable with them, solving even the most complicated parsing problems is relatively simple.学习正则表达式的好处在于,一旦熟悉了它们,即使解决最复杂的解析问题也相对简单。 This pattern could be improved to be more or less forgiving if necessary -- this one for example will return '678' if you pass it 'http://123.45.678.90', but should work great on just about any other URL you can come up with.如有必要,这种模式可以改进为或多或少宽容——例如,如果你传递它'http://123.45.678.90',这个模式将返回'678',但应该可以很好地处理任何其他你可以使用的 URL拿出来。 Regexr is a great resource for learning and testing regexes. Regexr是学习和测试正则表达式的绝佳资源。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM