在python中使用正则表达式获取html标签

Question

I am trying to read through an html doc using python and gather all of the table rows into a single list. 我正在尝试使用python阅读html文档，并将所有表行收集到一个列表中。 (I am aware of specialized tools for this purpose, but I must use regex.) Here is my code so far: （我知道用于此目的的专用工具，但必须使用regex。）到目前为止，这是我的代码：

import urllib
import re
URL = 'http://www.xpn.org/events/concert-calendar'
sock = urllib.urlopen( URL )
doc = sock.read()
sock.close()
patString = r'''
    < tr(. * ?)>
    (.*?)
    < /tr>
    '''
pattern = re.compile(patString, re.VERBOSE)
concerts = re.findall(pattern, doc)
print (concerts)

However, the print is only printing an empty list. 但是，打印仅打印一个空列表。 I have tried a few different patterns but all have produced the same result. 我尝试了几种不同的模式，但是都产生了相同的结果。 I'm pretty sure that the issue is the pattern, but I'm not entirely sure (as I am trying to become accommodated with python while writing this.) the table rows I am trying to find have the format <tr class="odd/even"> other data </tr> and I would like to capture all of this data and place it into a list for use later in the script. 我很确定问题出在模式上，但是我不太确定（因为我在编写此代码时正试图被python容纳。）我要查找的表行格式为<tr class="odd/even"> other data </tr> ，我想捕获所有这些数据并将其放入列表中，以供稍后在脚本中使用。

Any help is appreciated. 任何帮助表示赞赏。 Thanks 谢谢

Answer 1

This matches your sample data just fine. 这与您的样本数据完全匹配。 If the data runs on multiple lines, turn on the option for . 如果数据在多行上运行，请打开的选项. to match \\n . 匹配\\n 。 That option is re.DOTALL by the way. re.DOTALL ，该选项是re.DOTALL 。

<tr(.*?)>(.*?)</tr>

The ? ? qualification for the data in the middle is pretty important, otherwise it could match entire <tr></tr> blocks as the data part. 中间数据的限定非常重要，否则它可以匹配整个<tr></tr>块作为数据部分。

It is easy because you are not parsing HTML, but instead just trying to extract some tags in a very specific case. 这很容易，因为您没有解析HTML，而是尝试在非常特殊的情况下提取一些标签。

Things will get ugly if you have a <tr> in a <tr> for example. 事情会变得丑陋，如果你有一个<tr>在<tr>的例子。

在python中使用正则表达式获取html标签

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-05-09 17:31:42

在python中使用正则表达式获取html标签

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-05-09 17:31:42

解决方案1
3 已采纳 2014-05-09 17:31:42