简体   繁体   中英

Match HTML tags in two strings using regex in Python

I want to verify that the HTML tags present in a source string are also present in a target string.

For example:

>> source = '<em>Hello</em><label>What's your name</label>'
>> verify_target(’<em>Hi</em><label>My name is Jim</label>')
True
>> verify_target('<label>My name is Jim</label><em>Hi</em>')
True
>> verify_target('<em>Hi<label>My name is Jim</label></em>')
False

I would get rid of Regex and look at Beautiful Soup .
findAll(True) lists all the tags found in your source.

from BeautifulSoup import BeautifulSoup 
soup = BeautifulSoup(source)
allTags = soup.findAll(True)
[tag.name for tag in allTags ]
[u'em', u'label']

then you just need to remove possible duplicates and confront your tags lists.

This snippet verifies that ALL of source's tags are present in target's tags.

from BeautifulSoup import BeautifulSoup
def get_tags_set(source):
    soup = BeautifulSoup(source)
    all_tags = soup.findAll(True)
    return set([tag.name for tag in all_tags])

def verify(tags_source_orig, tags_source_to_verify):
    return tags_source_orig == set.intersection(tags_source_orig, tags_source_to_verify)

source= '<label>What\'s your name</label><label>What\'s your name</label><em>Hello</em>'
source_to_verify= '<em>Hello</em><label>What\'s your name</label><label>What\'s your name</label>'
print verify(get_tags_set(source),get_tags_set(source_to_verify))

I don't think that regex is the right way here, basically because html is not always just a string, but it's a bit more complex, with nested tags.

I suggest you to use HTMLParser , create a class with parses the original source and builds a structure on it. Then verify that the same data structure is valid for the targets to be verified.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM