需要Python正则表达式帮助

Question

I need to get info from a website that outputs it between needed-info-here OR needed-info-here , randomly. 我需要从在needed-info-here或needed-info-here needed-info-here之间输出信息的网站获取信息needed-info-here 。

I can get it when I use 我用的时候可以拿到

start = '<font color="red">'
end = '</font>'
expression = start + '(.*?)' + end
match = re.compile(expression).search(web_source_code)
needed_info = match.group(1)

, but then I have to pick to fetch either  or  , failing, when the site uses the other tag. ，但是当站点使用其他标记时，我必须选择获取或 ，但失败。

How do I modify the regular expression so it would always succeed? 如何修改正则表达式，使其始终成功？

Answer 1

Don't parse HTML with regex. 不要用正则表达式解析HTML。

Regex is not the right tool to use for this problem. 正则表达式不是解决此问题的正确工具。 Look up BeautifulSoup or lxml . 查找BeautifulSoup或lxml 。

Answer 2

You can join two alternatives with a vertical bar: 您可以使用竖线将两个替代项连接起来：

start = '<font color="red">|<span style="font-weight:bold;">'
end = '</font>|</span>'

since you know that a font tag will always be closed by  , a span tag always by  . 因为您知道总是会用来关闭标签，所以总是用span来关闭span标签。

However, consider also using a solid HTML parser such as BeautifulSoup, rather than rolling your own regular expressions, to parse HTML, which is particularly unsuitable in general for getting parsed by regular expressions. 但是，也可以考虑使用诸如BeautifulSoup之类的可靠HTML解析器，而不是滚动自己的正则表达式来解析HTML，这在通常情况下尤其不适合通过正则表达式进行解析。

Answer 3

Although regular expressions are not your best choice for parsing HTML. 尽管正则表达式不是解析HTML的最佳选择。

For the sake of education, here is a possible answer to your question: 为了教育起见，以下是您的问题的可能答案：

start = '<(?P<tag>font|tag) color="red">'
end = '</(?P=tag)>'
expression = start + '(.*?)' + end

Answer 4

expression = '(<font color="red">(.*?)</font>|<span style="font-weight:bold;">(.*?)</span>)'
match = re.compile(expression).search(web_source_code)
needed_info = match.group(2)

This would get the job done but you shouldn't really be using regex to parse html 这样就可以完成工作，但是您不应该真正使用正则表达式来解析html。

Answer 5

Regex and HTML are not such a good match, HTML has too many potential variations that will trip up your regex. 正则表达式和HTML并不是很好的匹配，HTML的潜在变化太多，会破坏您的正则表达式。 BeautifulSoup is the standard tool to employ here, but I find pyparsing can be just as effective, and sometimes even simpler to construct when trying to locate a particular tag relative to a particular previous tag. BeautifulSoup是在此处使用的标准工具，但是我发现pyparsing可以同样有效，有时在尝试相对于特定的先前标签定位特定标签时甚至更容易构造。

Here is how to address your question using pyparsing: 这是如何使用pyparsing解决您的问题：

html = """ need to get info from a website that outputs it between <font color="red">needed-info-here</font> OR <span style="font-weight:bold;">needed-info-here</span>, randomly.
<font color="white">but not this info</font> and 
<span style="font-weight:normal;">dont want this either</span>
"""

from pyparsing import *

font,fontEnd = makeHTMLTags("FONT")
# only match <font> tags with color="red"
font.setParseAction(withAttribute(color="red"))
# only match <span> tags with given style
span,spanEnd = makeHTMLTags("SPAN")
span.setParseAction(withAttribute(style="font-weight:bold;"))

# define full match patterns, define "body" results name for easy access
fontpattern = font + SkipTo(fontEnd)("body") + fontEnd
spanpattern = span + SkipTo(spanEnd)("body") + spanEnd

# now create a single pattern, matching either of the other patterns
searchpattern = fontpattern | spanpattern

# call searchString, and extract body element from each match
for text in searchpattern.searchString(html):
    print text.body

Prints: 印刷品：

needed-info-here
needed-info-here

Answer 6

I haven't used Python, but if you make expressions equal to the following, it should work: 我没有使用过Python，但是如果您使表达式等于以下内容，则应该可以使用：

/(?P<open><(font|span)[^>]*>)(?P<info>[^<]+)(?P<close><\/(font|span)>)/gi

Then just access your needed info with the name "info". 然后只需使用名称“ info”访问您所需的信息。

PS - I also agree about the "not parsing HTML with regex" rule, but if you know that it will appear in either font or span tags, then so be it... PS-我也同意“不使用正则表达式解析HTML”规则，但是如果您知道它会以字体或跨度标签显示，那就这样吧...

Also, why use the font tag? 另外，为什么要使用字体标签？ I haven't used a font tag since I learned CSS. 自从学习CSS以来，我就没有使用过字体标签。

需要Python正则表达式帮助

问题描述

6 个解决方案

解决方案1
7 2010-08-01 15:28:32

解决方案2
3 已采纳 2010-08-01 15:31:59

解决方案3
1 2010-08-01 15:31:18

解决方案4
1 2010-08-01 15:31:53

解决方案5
1 2010-08-01 15:47:05

解决方案6
0 2010-08-01 15:31:57

需要Python正则表达式帮助

问题描述

6 个解决方案

解决方案1 7 2010-08-01 15:28:32

解决方案2 3 已采纳 2010-08-01 15:31:59

解决方案3 1 2010-08-01 15:31:18

解决方案4 1 2010-08-01 15:31:53

解决方案5 1 2010-08-01 15:47:05

解决方案6 0 2010-08-01 15:31:57

解决方案1
7 2010-08-01 15:28:32

解决方案2
3 已采纳 2010-08-01 15:31:59

解决方案3
1 2010-08-01 15:31:18

解决方案4
1 2010-08-01 15:31:53

解决方案5
1 2010-08-01 15:47:05

解决方案6
0 2010-08-01 15:31:57