简体   繁体   English

使用Python中的正则表达式匹配两个字符串中的HTML标记

[英]Match HTML tags in two strings using regex in Python

I want to verify that the HTML tags present in a source string are also present in a target string. 我想验证源字符串中存在的HTML标记是否也存在于目标字符串中。

For example: 例如:

>> source = '<em>Hello</em><label>What's your name</label>'
>> verify_target(’<em>Hi</em><label>My name is Jim</label>')
True
>> verify_target('<label>My name is Jim</label><em>Hi</em>')
True
>> verify_target('<em>Hi<label>My name is Jim</label></em>')
False

I would get rid of Regex and look at Beautiful Soup . 我会摆脱正则表达,看看美丽的汤
findAll(True) lists all the tags found in your source. findAll(True)列出源中找到的所有标记。

from BeautifulSoup import BeautifulSoup 
soup = BeautifulSoup(source)
allTags = soup.findAll(True)
[tag.name for tag in allTags ]
[u'em', u'label']

then you just need to remove possible duplicates and confront your tags lists. 那么你只需要删除可能的重复项并面对你的标签列表。

This snippet verifies that ALL of source's tags are present in target's tags. 此片段验证所有源标记都存在于目标的标记中。

from BeautifulSoup import BeautifulSoup
def get_tags_set(source):
    soup = BeautifulSoup(source)
    all_tags = soup.findAll(True)
    return set([tag.name for tag in all_tags])

def verify(tags_source_orig, tags_source_to_verify):
    return tags_source_orig == set.intersection(tags_source_orig, tags_source_to_verify)

source= '<label>What\'s your name</label><label>What\'s your name</label><em>Hello</em>'
source_to_verify= '<em>Hello</em><label>What\'s your name</label><label>What\'s your name</label>'
print verify(get_tags_set(source),get_tags_set(source_to_verify))

I don't think that regex is the right way here, basically because html is not always just a string, but it's a bit more complex, with nested tags. 我不认为正则表达式是正确的方法,主要是因为html并不总是只是一个字符串,但它有点复杂,嵌套标签。

I suggest you to use HTMLParser , create a class with parses the original source and builds a structure on it. 我建议你使用HTMLParser ,创建一个解析原始源的类,并在其上构建一个结构。 Then verify that the same data structure is valid for the targets to be verified. 然后验证相同的数据结构对于要验证的目标是否有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM