简体   繁体   English

用于提取URL的Python正则表达式

[英]Python Regular Expression for Extrating URL

I'm working on a regular expression and was wondering how to extract URL from a HTML page. 我正在研究正则表达式,并想知道如何从HTML页面中提取URL。 I want to print out the url from this line: 我想从这一行打印出网址:

Website is: http://www.somesite.com 

Everytime that link is found, I want to just extract what URL is there after **Website is:** Any help will be appreciated. 每次找到链接时,我想提取**Website is:**之后的网址是什么**Website is:**任何帮助将不胜感激。

Will this suffice or do you need to be more specific? 这是否足够或者您需要更具体吗?

In [230]: s = 'Website is: http://www.somesite.com '
In [231]: re.findall('Website is:\s+(\S+)', s)
Out[231]: ['http://www.somesite.com']

You could match each line to a regular expression with a capturing group, like so: 您可以将每一行与正常表达式与捕获组匹配,如下所示:

for l in page:
    m = re.match("Website is: (.*)")
    if m:
        print m.groups()[0]

This would both check if each line matched the pattern, and extract the link from it. 这将检查每一行是否与模式匹配,并从中提取链接。

A few pitfalls: 一些陷阱:

  1. This assumes that the "Website is" expression is always at the start of the line. 这假定“网站是”表达始终在行的开头。 If it's not, you could use re.search . 如果不是,您可以使用re.search

  2. This assumes there is exactly one space between the colon and the website. 假设冒号和网站之间只有一个空格。 If that's not true, you could change the expression to something like Website is:\\s+(http.*) . 如果不是这样,您可以将表达式更改为Website is:\\s+(http.*)

The specifics will depend on the page you are trying to parse. 具体细节取决于您尝试解析的页面。

Regex might be overkill for this since it's so simple. 由于它很简单,因此正则表达式可能有点过分。

def main():
    urls = []
    file = prepare_file("<yourfile>.html")
    for i in file:
         if "www" in i or "http://" in i:
             urls.append(i)
    return urls


def prepare_file(filename):
    file = open(filename)
    a = file.readlines() #splits on new lines
    a = [ i.strip() for i in [ x for x in a ] ] #remove white space
    a = filter(lambda x : x != '', a) #remove empty elements
    return a

URL are awkward to capture with regex, according to what I've read 根据我所读到的,使用正则表达式捕获URL很难

Probably using the following regex pattern will be good for you: 可能使用以下正则表达式模式对您有好处:

pat = 'Website is: (%s)' % fireball

where fireball is a pattern to catch URLs that you'll find here: 火球是一种捕捉您在此处可以找到的URL的模式:

daringfireball.net/2010/07/improved_regex_for_matching_urls daringfireball.net/2010/07/improved_regex_for_matching_urls

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM