简体   繁体   English

"如何使用 python 从字符串中提取 url?"

[英]How do you extract a url from a string using python?

For example:例如:

string = "This is a link http://www.google.com"

How could I extract 'http://www.google.com' ?我怎样才能提取 'http://www.google.com' ?

(Each link will be of the same format ie 'http://') (每个链接的格式相同,即“http://”)

There may be few ways to do this but the cleanest would be to use regex可能有几种方法可以做到这一点,但最干净的方法是使用正则表达式

>>> myString = "This is a link http://www.google.com"
>>> print re.search("(?P<url>https?://[^\s]+)", myString).group("url")
http://www.google.com

If there can be multiple links you can use something similar to below如果可以有多个链接,您可以使用类似于下面的内容

>>> myString = "These are the links http://www.google.com  and http://stackoverflow.com/questions/839994/extracting-a-url-in-python"
>>> print re.findall(r'(https?://[^\s]+)', myString)
['http://www.google.com', 'http://stackoverflow.com/questions/839994/extracting-a-url-in-python']
>>> 

In order to find a web URL in a generic string, you can use a regular expression (regex) .为了在通用字符串中查找 Web URL,您可以使用正则表达式 (regex)

A simple regex for URL matching like the following should fit your case.如下所示的用于 URL 匹配的简单正则表达式应该适合您的情况。

    regex = r'('

    # Scheme (HTTP, HTTPS, FTP and SFTP):
    regex += r'(?:(https?|s?ftp):\/\/)?'

    # www:
    regex += r'(?:www\.)?'

    regex += r'('

    # Host and domain (including ccSLD):
    regex += r'(?:(?:[A-Z0-9][A-Z0-9-]{0,61}[A-Z0-9]\.)+)'

    # TLD:
    regex += r'([A-Z]{2,6})'

    # IP Address:
    regex += r'|(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'

    regex += r')'

    # Port:
    regex += r'(?::(\d{1,5}))?'

    # Query path:
    regex += r'(?:(\/\S+)*)'

    regex += r')'

If you want to be even more precise, in the TLD section, you should ensure that the TLD is a valid TLD (see the entire list of valid TLDs here: https://data.iana.org/TLD/tlds-alpha-by-domain.txt ):如果你想更准确,在 TLD 部分,你应该确保 TLD 是一个有效的 TLD(在此处查看有效 TLD 的完整列表: https : //data.iana.org/TLD/tlds-alpha- by-domain.txt ):

    # TLD:
    regex += r'(com|net|org|eu|...)'

Then, you can simply compile the former regex and use it to find possible matches:然后,您可以简单地编译以前的正则表达式并使用它来查找可能的匹配项:

    import re

    string = "This is a link http://www.google.com"

    find_urls_in_string = re.compile(regex, re.IGNORECASE)
    url = find_urls_in_string.search(string)

    if url is not None and url.group(0) is not None:
        print("URL parts: " + str(url.groups()))
        print("URL" + url.group(0).strip())

Which, in case of the string "This is a link http://www.google.com " will output:其中,在字符串“This is a link http://www.google.com ”的情况下,将输出:

    URL parts: ('http://www.google.com', 'http', 'google.com', 'com', None, None)
    URL: http://www.google.com

If you change the input with a more complex URL, for example "This is also a URL https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo but this is not anymore" the output will be:如果您使用更复杂的 URL 更改输入,例如“这也是一个 URL https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo但这不是不再”输出将是:

    URL parts: ('https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo', 'https', 'host.domain.com', 'com', '80', '/path/page.php?query=value&a2=v2#foo')
    URL: https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo

NOTE: If you are looking for more URLs in a single string, you can still use the same regex, but just use findall() instead of search() .注意:如果您要在单个字符串中查找更多 URL,您仍然可以使用相同的正则表达式,但只需使用findall()而不是search()

There is another way how to extract URLs from text easily.还有另一种方法可以轻松地从文本中提取 URL。 You can use urlextract to do it for you, just install it via pip:您可以使用 urlextract 为您完成,只需通过 pip 安装即可:

pip install urlextract

and then you can use it like this:然后你可以像这样使用它:

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls("Let's have URL stackoverflow.com as an example.")
print(urls) # prints: ['stackoverflow.com']

You can find more info on my github page: https://github.com/lipoja/URLExtract您可以在我的 github 页面上找到更多信息: https : //github.com/lipoja/URLExtract

NOTE: It downloads a list of TLDs from iana.org to keep you up to date.注意:它会从 iana.org 下载 TLD 列表,以使您了解最新情况。 But if the program does not have internet access then it's not for you.但是,如果该程序无法访问 Internet,那么它不适合您。

This extracts all urls with parameters, somehow all above examples haven't worked for me这会提取所有带有参数的网址,不知何故以上所有示例都对我不起作用

import re

data = 'https://net2333.us3.list-some.com/subscribe/confirm?u=f3cca8a1ffdee924a6a413ae9&id=6c03fa85f8&e=6bbacccc5b'

WEB_URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""
re.findall(WEB_URL_REGEX, text)

You can extract any URL from a string using the following patterns,您可以使用以下模式从字符串中提取任何 URL,

1. 1.

>>> import re
>>> string = "This is a link http://www.google.com"
>>> pattern = r'[(http://)|\w]*?[\w]*\.[-/\w]*\.\w*[(/{1})]?[#-\./\w]*[(/{1,})]?'
>>> re.search(pattern, string)
http://www.google.com

>>> TWEET = ('New Pybites article: Module of the Week - Requests-cache '
         'for Repeated API Calls - http://pybit.es/requests-cache.html '
         '#python #APIs')
>>> re.search(pattern, TWEET)
http://pybit.es/requests-cache.html

>>> tweet = ('Pybites My Reading List | 12 Rules for Life - #books '
             'that expand the mind! '
             'http://pbreadinglist.herokuapp.com/books/'
             'TvEqDAAAQBAJ#.XVOriU5z2tA.twitter'
             ' #psychology #philosophy')
>>> re.findall(pattern, TWEET)
['http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter']

to take the above pattern to the next level, we can also detect hashtags including URL the following ways为了将上述模式提升到一个新的水平,我们还可以通过以下方式检测包括 URL 在内的主题标签

2. 2.

>>> pattern = r'[(http://)|\w]*?[\w]*\.[-/\w]*\.\w*[(/{1})]?[#-\./\w]*[(/{1,})]?|#[.\w]*'
>>> re.findall(pattern, tweet)
['#books', http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter', '#psychology', '#philosophy']

The above example for taking URL and hashtags can be shortened to上面获取 URL 和主题标签的示例可以缩短为

>>> pattern = r'((?:#|http)\S+)'
>>> re.findall(pattern, tweet)
['#books', http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter', '#psychology', '#philosophy']

The pattern below can matches two alphanumeric separated by "."下面的模式可以匹配两个以“.”分隔的字母数字。 as URL作为网址

>>> pattern = pattern =  r'(?:http://)?\w+\.\S*[^.\s]'

>>> tweet = ('PyBites My Reading List | 12 Rules for Life - #books '
             'that expand the mind! '
             'www.google.com/telephone/wire....  '
             'http://pbreadinglist.herokuapp.com/books/'
             'TvEqDAAAQBAJ#.XVOriU5z2tA.twitter '
             "http://-www.pip.org "
             "google.com "
             "twitter.com "
             "facebook.com"
             ' #psychology #philosophy')
>>> re.findall(pattern, tweet)
['www.google.com/telephone/wire', 'http://pbreadinglist.herokuapp.com/books/TvEqDAAAQBAJ#.XVOriU5z2tA.twitter', 'www.pip.org', 'google.com', 'twitter.com', 'facebook.com']

You can try any complicated URL with the number 1 & 2 pattern.您可以尝试使用数字 1 和 2 模式的任何复杂 URL。 To learn more about re module in python, do check this out REGEXES IN PYTHON by Real Python.要了解有关 Python 中 re 模块的更多信息,请查看 Real Python 的REGEXES IN PYTHON

Cheers!干杯!

I've used a slight variation from @Abhijit's accepted answer.我使用了与@Abhijit 接受的答案稍有不同的地方。

This one uses \\S instead of [^\\s] , which is equivalent but more concise.这个使用\\S而不是[^\\s] ,这是等效的,但更简洁。 It also doesn't use a named group, because there is just one and we can ommit the name for simplicity reasons:它也没有使用命名组,因为只有一个组,为了简单起见,我们可以省略名称:

import re

my_string = "This is my tweet check it out http://example.com/blah"
print(re.search(r'(https?://\S+)', my_string).group())

Of course, if there are multiple links to extract, just use .findall() :当然,如果要提取多个链接,只需使用.findall()

print(re.findall(r'(https?://\S+)', my_string))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM