简体   繁体   English

提取正则表达式匹配的一部分

[英]Extract part of a regex match

I want a regular expression to extract the title from a HTML page.我想要一个正则表达式来从 HTML 页面中提取标题。 Currently I have this:目前我有这个:

title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
    title = title.replace('<title>', '').replace('</title>', '') 

Is there a regular expression to extract just the contents of <title> so I don't have to remove the tags?是否有正则表达式可以仅提取 <title> 的内容,因此我不必删除标签?

Use ( ) in regexp and group(1) in python to retrieve the captured string ( re.search will return None if it doesn't find the result, so don't use group() directly ):在 regexp 中使用( )并在 python 中使用group(1)来检索捕获的字符串(如果re.search没有找到结果,它将返回None ,所以不要直接使用group() ):

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)

Note that starting in Python 3.8 , and the introduction of assignment expressions (PEP 572) ( := operator), it's possible to improve a bit on Krzysztof Krasoń's solution by capturing the match result directly within the if condition as a variable and re-use it in the condition's body:请注意,从Python 3.8开始,并引入了赋值表达式 (PEP 572) ( :=运算符),可以通过在 if 条件中直接捕获匹配结果作为变量并重用来改进Krzysztof Krasoń 的解决方案它在条件的身体:

# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
  title = match.group(1)
# hello

尝试使用捕获组:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

May I recommend you to Beautiful Soup.我可以向您推荐美丽的汤。 Soup is a very good lib to parse all of your html document. Soup 是一个非常好的库,可以解析所有的 html 文档。

soup = BeatifulSoup(html_doc)
titleName = soup.title.name

尝试:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)

I'd think this should suffice:我认为这应该足够了:

#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)

... assuming that your text (HTML) is in a variable named "text." ...假设您的文本 (HTML) 位于名为“text”的变量中。

This also assumes that there are no other HTML tags which can be legally embedded inside of an HTML TITLE tag and there exists no way to legally embed any other < character within such a container/block.这也假设没有其他 HTML 标记可以合法地嵌入到 HTML TITLE 标记中,并且没有办法在这样的容器/块中合法地嵌入任何其他 < 字符。

However ...然而...

Don't use regular expressions for HTML parsing in Python.不要在 Python 中使用正则表达式进行 HTML 解析。 Use an HTML parser!使用 HTML 解析器! (Unless you're going to write a full parser, which would be a of extra, and redundant work when various HTML, SGML and XML parsers are already in the standard libraries). (除非您要编写一个完整的解析器,当各种 HTML、SGML 和 XML 解析器已经在标准库中时,这将是一项额外的和冗余的工作)。

If you're handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package.如果您正在处理“真实世界”标签汤HTML(通常不符合任何 SGML/XML 验证器),请使用BeautifulSoup包。 It isn't in the standard libraries (yet) but is widely recommended for this purpose.它不在标准库中(目前),但为此目的被广泛推荐。

Another option is: lxml ... which is written for properly structured (standards conformant) HTML.另一种选择是: lxml ...它是为正确结构化(符合标准)的HTML编写的。 But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup .但它可以选择使用 BeautifulSoup 作为解析器: ElementSoup

The provided pieces of code do not cope with Exceptions May I suggest提供的代码片段不能处理我建议的Exceptions

getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]

This returns an empty string by default if the pattern has not been found, or the first match.如果未找到模式或第一个匹配项,则默认情况下返回一个空字符串。

The currently top-voted answer by Krzysztof Krasoń fails with <title>a</title><title>b</title> . Krzysztof Krasoń 目前投票最多的答案以<title>a</title><title>b</title>失败。 Also, it ignores title tags crossing line boundaries, eg, for line-length reasons.此外,它会忽略跨越行边界的标题标签,例如,出于行长的原因。 Finally, it fails with <title >a</title> (which is valid HTML: White space inside XML/HTML tags ).最后,它以<title >a</title>失败(这是有效的 HTML: XML/HTML 标记内的空白)。

I therefore propose the following improvement:因此,我提出以下改进:

import re

def search_title(html):
    m = re.search(r"<title\s*>(.*?)</title\s*>", html, re.IGNORECASE | re.DOTALL)
    return m.group(1) if m else None

Test cases:测试用例:

print(search_title("<title   >with spaces in tags</title >"))
print(search_title("<title\n>with newline in tags</title\n>"))
print(search_title("<title>first of two titles</title><title>second title</title>"))
print(search_title("<title>with newline\n in title</title\n>"))

Output:输出:

with spaces in tags
with newline in tags
first of two titles
with newline
  in title

Ultimately, I go along with others recommending an HTML parser - not only, but also to handle non-standard use of HTML tags.最终,我和其他人一起推荐了一个 HTML 解析器 - 不仅,而且处理 HTML 标记的非标准使用。

I needed something to match package-0.0.1 (name, version) but want to reject an invalid version such as 0.0.010 .我需要一些东西来匹配package-0.0.1 (name, version) 但想要拒绝无效版本,例如0.0.010

See regex101 example.请参阅regex101示例。

import re

RE_IDENTIFIER = re.compile(r'^([a-z]+)-((?:(?:0|[1-9](?:[0-9]+)?)\.){2}(?:0|[1-9](?:[0-9]+)?))$')

example = 'hello-0.0.1'

if match := RE_IDENTIFIER.search(example):
    name, version = match.groups()
    print(f'Name:     {name}')
    print(f'Version:  {version}')
else:
    raise ValueError(f'Invalid identifier {example}')

Output:输出:

Name:     hello
Version:  0.0.1

Is there a particular reason why no one suggested using lookahead and lookbehind?为什么没有人建议使用前瞻和后瞻,有什么特别的原因吗? I got here trying to do the exact same thing and (?<=<title>).+(?=<\/title>) works great.我来到这里试图做同样的事情, (?<=<title>).+(?=<\/title>)效果很好。 It will only match whats between parentheses so you don't have to do the whole group thing.它只会匹配括号之间的内容,因此您不必执行整个组的操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM