简体   繁体   English

在python中使用正则表达式获取html标签

[英]Using regex in python for html tags

I am trying to read through an html doc using python and gather all of the table rows into a single list. 我正在尝试使用python阅读html文档,并将所有表行收集到一个列表中。 (I am aware of specialized tools for this purpose, but I must use regex.) Here is my code so far: (我知道用于此目的的专用工具,但必须使用regex。)到目前为止,这是我的代码:

import urllib
import re
URL = 'http://www.xpn.org/events/concert-calendar'
sock = urllib.urlopen( URL )
doc = sock.read()
sock.close()
patString = r'''
    < tr(. * ?)>
    (.*?)
    < /tr>
    '''
pattern = re.compile(patString, re.VERBOSE)
concerts = re.findall(pattern, doc)
print (concerts)

However, the print is only printing an empty list. 但是,打印仅打印一个空列表。 I have tried a few different patterns but all have produced the same result. 我尝试了几种不同的模式,但是都产生了相同的结果。 I'm pretty sure that the issue is the pattern, but I'm not entirely sure (as I am trying to become accommodated with python while writing this.) the table rows I am trying to find have the format <tr class="odd/even"> other data </tr> and I would like to capture all of this data and place it into a list for use later in the script. 我很确定问题出在模式上,但是我不太确定(因为我在编写此代码时正试图被python容纳。)我要查找的表行格式为<tr class="odd/even"> other data </tr> ,我想捕获所有这些数据并将其放入列表中,以供稍后在脚本中使用。

Any help is appreciated. 任何帮助表示赞赏。 Thanks 谢谢

This matches your sample data just fine. 这与您的样本数据完全匹配。 If the data runs on multiple lines, turn on the option for . 如果数据在多行上运行,请打开的选项. to match \\n . 匹配\\n That option is re.DOTALL by the way. re.DOTALL ,该选项是re.DOTALL

<tr(.*?)>(.*?)</tr>

The ? ? qualification for the data in the middle is pretty important, otherwise it could match entire <tr></tr> blocks as the data part. 中间数据的限定非常重要,否则它可以匹配整个<tr></tr>块作为数据部分。

It is easy because you are not parsing HTML, but instead just trying to extract some tags in a very specific case. 这很容易,因为您没有解析HTML,而是尝试在非常特殊的情况下提取一些标签。

Things will get ugly if you have a <tr> in a <tr> for example. 事情会变得丑陋,如果你有一个<tr><tr>的例子。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM