简体   繁体   English

需要Python正则表达式帮助

[英]Python regex help needed

I need to get info from a website that outputs it between <font color="red">needed-info-here</font> OR <span style="font-weight:bold;">needed-info-here</span> , randomly. 我需要从在<font color="red">needed-info-here</font><span style="font-weight:bold;">needed-info-here</span> <font color="red">needed-info-here</font>之间输出信息的网站获取信息<span style="font-weight:bold;">needed-info-here</span>

I can get it when I use 我用的时候可以拿到

start = '<font color="red">'
end = '</font>'
expression = start + '(.*?)' + end
match = re.compile(expression).search(web_source_code)
needed_info = match.group(1)

, but then I have to pick to fetch either <font> or <span> , failing, when the site uses the other tag. ,但是当站点使用其他标记时,我必须选择获取<font><span> ,但失败。

How do I modify the regular expression so it would always succeed? 如何修改正则表达式,使其始终成功?

Don't parse HTML with regex. 不要用正则表达式解析HTML。

Regex is not the right tool to use for this problem. 正则表达式不是解决此问题的正确工具。 Look up BeautifulSoup or lxml . 查找BeautifulSouplxml

You can join two alternatives with a vertical bar: 您可以使用竖线将两个替代项连接起来:

start = '<font color="red">|<span style="font-weight:bold;">'
end = '</font>|</span>'

since you know that a font tag will always be closed by </font> , a span tag always by </span> . 因为您知道</font>总是会用</font>来关闭</font>标签,所以</span>总是用span来关闭span标签。

However, consider also using a solid HTML parser such as BeautifulSoup, rather than rolling your own regular expressions, to parse HTML, which is particularly unsuitable in general for getting parsed by regular expressions. 但是,也可以考虑使用诸如BeautifulSoup之类的可靠HTML解析器,而不是滚动自己的正则表达式来解析HTML,这在通常情况下尤其不适合通过正则表达式进行解析。

Although regular expressions are not your best choice for parsing HTML. 尽管正则表达式不是解析HTML的最佳选择。

For the sake of education, here is a possible answer to your question: 为了教育起见,以下是您的问题的可能答案:

start = '<(?P<tag>font|tag) color="red">'
end = '</(?P=tag)>'
expression = start + '(.*?)' + end
expression = '(<font color="red">(.*?)</font>|<span style="font-weight:bold;">(.*?)</span>)'
match = re.compile(expression).search(web_source_code)
needed_info = match.group(2)

This would get the job done but you shouldn't really be using regex to parse html 这样就可以完成工作,但是您不应该真正使用正则表达式来解析html。

Regex and HTML are not such a good match, HTML has too many potential variations that will trip up your regex. 正则表达式和HTML并不是很好的匹配,HTML的潜在变化太多,会破坏您的正则表达式。 BeautifulSoup is the standard tool to employ here, but I find pyparsing can be just as effective, and sometimes even simpler to construct when trying to locate a particular tag relative to a particular previous tag. BeautifulSoup是在此处使用的标准工具,但是我发现pyparsing可以同样有效,有时在尝试相对于特定的先前标签定位特定标签时甚至更容易构造。

Here is how to address your question using pyparsing: 这是如何使用pyparsing解决您的问题:

html = """ need to get info from a website that outputs it between <font color="red">needed-info-here</font> OR <span style="font-weight:bold;">needed-info-here</span>, randomly.
<font color="white">but not this info</font> and 
<span style="font-weight:normal;">dont want this either</span>
"""

from pyparsing import *

font,fontEnd = makeHTMLTags("FONT")
# only match <font> tags with color="red"
font.setParseAction(withAttribute(color="red"))
# only match <span> tags with given style
span,spanEnd = makeHTMLTags("SPAN")
span.setParseAction(withAttribute(style="font-weight:bold;"))

# define full match patterns, define "body" results name for easy access
fontpattern = font + SkipTo(fontEnd)("body") + fontEnd
spanpattern = span + SkipTo(spanEnd)("body") + spanEnd

# now create a single pattern, matching either of the other patterns
searchpattern = fontpattern | spanpattern

# call searchString, and extract body element from each match
for text in searchpattern.searchString(html):
    print text.body

Prints: 印刷品:

needed-info-here
needed-info-here

I haven't used Python, but if you make expressions equal to the following, it should work: 我没有使用过Python,但是如果您使表达式等于以下内容,则应该可以使用:

/(?P<open><(font|span)[^>]*>)(?P<info>[^<]+)(?P<close><\/(font|span)>)/gi

Then just access your needed info with the name "info". 然后只需使用名称“ info”访问您所需的信息。

PS - I also agree about the "not parsing HTML with regex" rule, but if you know that it will appear in either font or span tags, then so be it... PS-我也同意“不使用正则表达式解析HTML”规则,但是如果您知道它会以字体或跨度标签显示,那就这样吧...

Also, why use the font tag? 另外,为什么要使用字体标签? I haven't used a font tag since I learned CSS. 自从学习CSS以来,我就没有使用过字体标签。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM