简体   繁体   English

Python:在元组中存储许多正则表达式匹配?

[英]Python: store many regex matches in tuple?

I'm trying to make a simple Python-based HTML parser using regular expressions. 我正在尝试使用正则表达式创建一个简单的基于Python的HTML解析器。 My problem is trying to get my regex search query to find all the possible matches, then store them in a tuple. 我的问题是试图让我的正则表达式搜索查询找到所有可能的匹配,然后将它们存储在元组中。

Let's say I have a page with the following stored in the variable HTMLtext : 假设我有一个页面,其中包含以下存储在变量HTMLtext

<ul>
<li class="active"><b><a href="/blog/home">Back to the index</a></b></li>
<li><b><a href="/blog/about">About Me!</a></b></li>
<li><b><a href="/blog/music">Audio Production</a></b></li>
<li><b><a href="/blog/photos">Gallery</a></b></li>
<li><b><a href="/blog/stuff">Misc</a></b></li>
<li><b><a href="/blog/contact">Shoot me an email</a></b></li>
</ul>

I want to perform a regex search on this text and return a tuple containing the last URL directory of each link. 我想对此文本执行正则表达式搜索并返回包含每个链接的最后一个URL目录的元组。 So, I'd like to return something like this: 所以,我想回复这样的事情:

pages = ["home", "about", "music", "photos", "stuff", "contact"]

So far, I'm able to use regex to search for one result: 到目前为止,我可以使用正则表达式搜索一个结果:

pages = [re.compile('<a href="/blog/(.*)">').search(HTMLtext).group(1)]

Running this expression makes pages = ['home'] . 运行此表达式会使pages = ['home']

How can I get the regex search to continue for the whole text, appending the matched text to this tuple? 如何让正则表达式搜索继续整个文本,将匹配的文本附加到此元组?

(Note: I know I probably should NOT be using regex to parse HTML . But I want to know how to do this anyway.) (注意: 我知道我可能不应该使用正则表达式来解析HTML 。但我想知道如何做到这一点。)

Use findall function of re module: 使用re模块的findall功能:

pages = re.findall('<a href="/blog/([^"]*)">',HTMLtext)
print(pages)

Output: 输出:

['home', 'about', 'music', 'photos', 'stuff', 'contact']

Your pattern won't work on all inputs, including yours. 您的模式不适用于所有输入,包括您的输入。 The .* is going to be too greedy (technically, it finds a maximal match), causing it to be the first href and the last corresponding close. .*将太贪婪(技术上,它找到一个最大匹配),导致它是第一个href和最后一个相应的关闭。 The two simplest ways to fix this is to use either a minimal match, or else a negates character class. 解决这个问题的两种最简单的方法是使用最小匹配,或者使用否定字符类。

# minimal match approach
pages = re.findall(r'<a\s+href="/blog/(.+?)">', 
                   full_html_text, re.I + re.S)

# negated charclass approach
pages = re.findall(r'<a\s+href="/blog/([^"]+)">',
                   full_html_text, re.I)

Obligatory Warning 强制性警告

For simple and reasonably well-constrained text, regexes are just fine; 对于简单且合理的约束文本,正则表达式很好; after all, that's why we use regex search-and-replace in our text editors when editing HTML! 毕竟,这就是我们在编辑HTML时在文本编辑器中使用正则表达式搜索和替换的原因! However, it gets more and more complicated the less you know about the input, such as 但是,它越来越复杂,你对输入的了解越少,例如

  • if there's some other field intervening between the <a and the href , like <a title="foo" href="bar"> 如果有一些其他领域的介于其间<ahref<a title="foo" href="bar">
  • casing issues like <A HREF='foo'> 套管问题如<A HREF='foo'>
  • whitespace issues 空白问题
  • alternate quotes like href='/foo/bar' instead of href="/foo/bar" 替代引号如href='/foo/bar'而不是href="/foo/bar"
  • embedded HTML comments 嵌入式HTML评论

That's not an exclusive list of concerns; 这不是关注的唯一清单; there are others. 还有其他人。 And so, using regexes on HTML thus is possible but whether it's expedient depends on too many other factors to judge. 因此, 在HTML上使用正则表达式是可能的,但它是否有利于取决于太多其他因素来判断。

However, from the little example you've shown, it looks perfectly ok for your own case. 但是,从您展示的小例子来看,它看起来完全适合您自己的情况。 You just have to spiff up your pattern and call the right method. 你只需要搞定你的模式并调用正确的方法。

Use findall instead of search : 使用findall而不是search

>>> pages = re.compile('<a href="/blog/(.*)">').findall(HTMLtext)
>>> pages
['home', 'about', 'music', 'photos', 'stuff', 'contact']

re.findall()函数和re.finditer()函数用于查找多个匹配项。

To find all results use findall() . 要查找所有结果,请使用findall() Also you need to compile the re only once and then you can reuse it. 你也需要编译re只有一次,那么你就可以重新使用它。

href_re = re.compile('<a href="/blog/(.*)">')  # Compile the regexp once

pages = href_re.findall(HTMLtext)  # Find all matches - ["home", "about",

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM