Match HTML tags in two strings using regex in Python

Question

I want to verify that the HTML tags present in a source string are also present in a target string.

For example:

>> source = '<em>Hello</em><label>What's your name</label>'
>> verify_target(’<em>Hi</em><label>My name is Jim</label>')
True
>> verify_target('<label>My name is Jim</label><em>Hi</em>')
True
>> verify_target('<em>Hi<label>My name is Jim</label></em>')
False

Answer 1

I would get rid of Regex and look at Beautiful Soup .
findAll(True) lists all the tags found in your source.

from BeautifulSoup import BeautifulSoup 
soup = BeautifulSoup(source)
allTags = soup.findAll(True)
[tag.name for tag in allTags ]
[u'em', u'label']

then you just need to remove possible duplicates and confront your tags lists.

This snippet verifies that ALL of source's tags are present in target's tags.

from BeautifulSoup import BeautifulSoup
def get_tags_set(source):
    soup = BeautifulSoup(source)
    all_tags = soup.findAll(True)
    return set([tag.name for tag in all_tags])

def verify(tags_source_orig, tags_source_to_verify):
    return tags_source_orig == set.intersection(tags_source_orig, tags_source_to_verify)

source= '<label>What\'s your name</label><label>What\'s your name</label><em>Hello</em>'
source_to_verify= '<em>Hello</em><label>What\'s your name</label><label>What\'s your name</label>'
print verify(get_tags_set(source),get_tags_set(source_to_verify))

Answer 2

I don't think that regex is the right way here, basically because html is not always just a string, but it's a bit more complex, with nested tags.

I suggest you to use HTMLParser , create a class with parses the original source and builds a structure on it. Then verify that the same data structure is valid for the targets to be verified.

Match HTML tags in two strings using regex in Python

Question

2 answers

solution1
4 ACCPTED 2010-04-20 06:44:35

solution2
1 2010-04-20 06:39:17

Match HTML tags in two strings using regex in Python

Question

2 answers

solution1 4 ACCPTED 2010-04-20 06:44:35

solution2 1 2010-04-20 06:39:17

solution1
4 ACCPTED 2010-04-20 06:44:35

solution2
1 2010-04-20 06:39:17