简体   繁体   English

从 HTML 链接中提取 URL 的正则表达式

[英]Regular expression to extract URL from an HTML link

I'm a newbie in Python.我是 Python 的新手。 I'm learning regexes, but I need help here.我正在学习正则表达式,但我需要帮助。

Here comes the HTML source:下面是 HTML 源代码:

<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>

I'm trying to code a tool that only prints out http://ptop.se .我正在尝试编写一个只打印出http://ptop.se的工具。 Can you help me please?你能帮我吗?

If you're only looking for one:如果您只寻找一种:

import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
    print(match.group(1))

If you have a long string, and want every instance of the pattern in it:如果您有一个很长的字符串,并且希望其中包含模式的每个实例:

import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))

Where s is the string that you're looking for matches in.其中s是您要查找匹配项的字符串。

Quick explanation of the regexp bits: regexp 位的快速解释:

r'...' is a "raw" string. r'...'是一个“原始”字符串。 It stops you having to worry about escaping characters quite as much as you normally would.它使您不必像往常一样担心转义字符。 ( \\ especially -- in a raw string a \\ is just a \\ . In a regular string you'd have to do \\\\ every time, and that gets old in regexps.) (特别是\\ - 在原始字符串中, \\只是一个\\ 。在常规字符串中,您每次都必须执行\\\\ ,这在正则表达式中会变。)

" href=[\\'"]? " href=[\\'"]? " says to match "href=", possibly followed by a ' or " . " 表示匹配 "href=",可能后跟一个'" "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required. “可能”是因为很难说您正在查看的 HTML 有多可怕,而且引号不是严格要求的。

Enclosing the next bit in " () " says to make it a "group", which means to split it out and return it separately to us.将下一位括在“ () ”中表示将其设为“组”,这意味着将其拆分并单独返回给我们。 It's just a way to say "this is the part of the pattern I'm interested in."这只是表达“这是我感兴趣的模式的一部分”的一种方式。

" [^\\'" >]+ " says to match any characters that aren't ' , " , > , or a space. " [^\\'" >]+ " 表示匹配任何不是'">或空格的字符。 Essentially this is a list of characters that are an end to the URL.本质上,这是作为 URL 结尾的字符列表。 It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.它让我们避免尝试编写可靠匹配完整 URL 的正则表达式,这可能有点复杂。

The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements.使用 BeautifulSoup 的另一个答案中的建议还不错,但它确实引入了更高级别的外部要求。 Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.此外,它对您学习正则表达式的既定目标没有帮助,我认为这个特定的 html 解析项目只是其中的一部分。

It's pretty easy to do:这很容易做到:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
    print(tag['href'])

Once you've installed BeautifulSoup, anyway.无论如何,一旦您安装了 BeautifulSoup。

Don't use regexes, use BeautifulSoup .不要使用正则表达式,使用BeautifulSoup That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders.那,或者太粗鲁以至于将它生成到 w3m/lynx 并撤回 w3m/lynx 渲染的内容。 First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.第一个可能更优雅,第二个只是在我之前写的一些未优化的代码上工作得更快。

this should work, although there might be more elegant ways.这应该有效,尽管可能有更优雅的方法。

import re
url='<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)

John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text: John Gruber(他编写了 Markdown,Markdown 由正则表达式组成,并在 Stack Overflow 上使用)尝试生成识别文本中 URL 的正则表达式:

http://daringfireball.net/2009/11/liberal_regex_for_matching_urls http://daringfireball.net/2009/11/liberal_regex_for_matching_urls

If you just want to grab the URL (ie you're not really trying to parse the HTML), this might be more lightweight than an HTML parser.如果您只想获取 URL(即您并不是真正尝试解析 HTML),这可能比 HTML 解析器更轻量级。

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why).正则表达式根本不擅长解析 HTML(请参阅您能否提供一些示例,说明为什么使用正则表达式很难解析 XML 和 HTML?原因)。 What you need is an HTML parser.你需要的是一个 HTML 解析器。 See Can you provide an example of parsing HTML with your favorite parser?请参阅您能否提供一个使用您最喜欢的解析器解析 HTML 的示例? for examples using a variety of parsers.例如使用各种解析器。

In particular you will want to look at the Python answers: BeautifulSoup , HTMLParser , and lxml .特别是你会想看看 Python 的答案: BeautifulSoupHTMLParserlxml

This works pretty well with using optional matches (prints after href= ) and gets the link only.这适用于使用可选匹配(在href=之后打印)并仅获取链接。 Tested on http://pythex.org/http://pythex.org/ 上测试

(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)

Oputput:输出:

Match 1. /wiki/Main_Page匹配 1. /wiki/Main_Page

Match 2. /wiki/Portal:Contents匹配 2. /wiki/Portal:Contents

Match 3. /wiki/Portal:Featured_content匹配 3. /wiki/Portal:Featured_content

Match 4. /wiki/Portal:Current_events匹配 4. /wiki/Portal:Current_events

Match 5. /wiki/Special:Random匹配 5. /wiki/Special:Random

Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en匹配 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en

Yes, there are tons of them on regexlib .是的, regexlib上有很多 That only proves that RE's should not be used to do that.那只能证明不应该使用 RE 来做到这一点。 Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's.使用 SGMLParser 或 BeautifulSoup 或编写解析器 - 但不要使用 RE。 The ones that seems to work are extremely compliated and still don't cover all cases.那些似乎有效的方法非常复杂,但仍然不能涵盖所有情况。

this regex can help you, you should get the first group by \\1 or whatever method you have in your language.这个正则表达式可以帮助你,你应该通过 \\1 或你的语言中的任何方法获得第一组。

href="([^"]*)

example:例子:

<a href="http://www.amghezi.com">amgheziName</a>

result:结果:

http://www.amghezi.com

你可以用这个。

<a[^>]+href=["'](.*?)["']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM