简体   繁体   English

在Python中提取一些HTML标记值

[英]Extracting some HTML tag values in Python

How to get a value of nested <b> HTML tag in Python using regular expressions? 如何使用正则表达式在Python中获取嵌套<b> HTML标记的值?

<a href="/model.xml?hid=90971&amp;modelid=4636873&amp;show-uid=678650012772883921" class="b-offers__name"><b>LG</b> X110</a>

# => LG X110

You don't. 你没有。

Regular Expressions are not well suited to deal with the nested structure of HTML. 正则表达式不适合处理HTML的嵌套结构。 Use an HTML parser instead. 请改用HTML解析器

Don't use regular expressions for parsing HTML. 不要使用正则表达式来解析HTML。 Use an HTML parser like BeautifulSoup . 使用像BeautifulSoup这样的HTML解析器。 Just look how easy it is: 看看它有多容易:

from BeautifulSoup import BeautifulSoup
html = r'<a href="removed because it was too long"><b>LG</b> X110</a>'
soup = BeautifulSoup(html)
print ''.join(soup.findAll(text=True))
# LG X110

Try this... 尝试这个...

<a.*<b>(.*)</b>(.*)</a>

$1 and $2 should be what you want, or whatever means Python has for printing captured groups. $ 1和$ 2应该是您想要的,或者Python用于打印捕获的组的任何方式。

Your question was very hard to understand, but from the given output example it looks like you want to strip everything within < and > from the input text. 您的问题很难理解,但从给定的输出示例中,您似乎想从输入文本中删除<>所有内容。 That can be done like so: 这可以这样做:

import re
input_text = '<a bob>i <b>c</b></a>'
output_text = re.sub('<[^>]*>', '', input_text)
print output_text

Which gives you: 哪个给你:

i c

If that is not what you want, please clarify. 如果那不是你想要的,请澄清。

Please note that the regular expression approach for parsing XML is very brittle. 请注意,解析XML的正则表达式方法非常脆弱。 For instance, the above example would break on the input <a name="b>c">hey</a> . 例如,上面的示例将在输入<a name="b>c">hey</a>中断。 ( > is a valid character in a attribute value: see XML specs ) >是属性值中的有效字符: 请参阅XML规范

+1 for Jens's answer. 为Jens的答案+1。 lxml is a good library you can use to actually parse this in a robust fashion. lxml是一个很好的库,你可以使用它以一种强大的方式实际解析它。 If you'd prefer something in the standard library, you can use sax , dom or elementree . 如果您更喜欢标准库中的某些内容,则可以使用saxdomelementree

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM