[英]Using regex in python for html tags
I am trying to read through an html doc using python and gather all of the table rows into a single list. 我正在尝试使用python阅读html文档,并将所有表行收集到一个列表中。 (I am aware of specialized tools for this purpose, but I must use regex.) Here is my code so far:
(我知道用于此目的的专用工具,但必须使用regex。)到目前为止,这是我的代码:
import urllib
import re
URL = 'http://www.xpn.org/events/concert-calendar'
sock = urllib.urlopen( URL )
doc = sock.read()
sock.close()
patString = r'''
< tr(. * ?)>
(.*?)
< /tr>
'''
pattern = re.compile(patString, re.VERBOSE)
concerts = re.findall(pattern, doc)
print (concerts)
However, the print is only printing an empty list. 但是,打印仅打印一个空列表。 I have tried a few different patterns but all have produced the same result.
我尝试了几种不同的模式,但是都产生了相同的结果。 I'm pretty sure that the issue is the pattern, but I'm not entirely sure (as I am trying to become accommodated with python while writing this.) the table rows I am trying to find have the format
<tr class="odd/even"> other data </tr>
and I would like to capture all of this data and place it into a list for use later in the script. 我很确定问题出在模式上,但是我不太确定(因为我在编写此代码时正试图被python容纳。)我要查找的表行格式为
<tr class="odd/even"> other data </tr>
,我想捕获所有这些数据并将其放入列表中,以供稍后在脚本中使用。
Any help is appreciated. 任何帮助表示赞赏。 Thanks
谢谢
This matches your sample data just fine. 这与您的样本数据完全匹配。 If the data runs on multiple lines, turn on the option for
.
如果数据在多行上运行,请打开的选项
.
to match \\n
. 匹配
\\n
。 That option is re.DOTALL
by the way. re.DOTALL
,该选项是re.DOTALL
。
<tr(.*?)>(.*?)</tr>
The ?
?
qualification for the data in the middle is pretty important, otherwise it could match entire <tr></tr>
blocks as the data part. 中间数据的限定非常重要,否则它可以匹配整个
<tr></tr>
块作为数据部分。
It is easy because you are not parsing HTML, but instead just trying to extract some tags in a very specific case. 这很容易,因为您没有解析HTML,而是尝试在非常特殊的情况下提取一些标签。
Things will get ugly if you have a <tr>
in a <tr>
for example. 事情会变得丑陋,如果你有一个
<tr>
在<tr>
的例子。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.