简体   繁体   中英

How to get all unique HTML tags on a webpage using regular expression?

I have html source code from an HTML page:

import requests

text = requests.get("https://en.wikipedia.org/wiki/Collatz_conjecture").text

What I would like to do is to get a count of the number of unique HTML tags on this page.

For example: <head> , <title> . Closing tags do not count ( <head> and </head> would be counted only once).

Yes, I know this is much easier using HTML parsers such as Beautiful Soup but I would like to accomplish this using only Regular Expression.

I've brute force counted this and the answer is in the ballpark of around 60 unique tags. How would I go about doing this?

I've already tried using re.findall() , to no avail.

Since the answer is around 60, I would like the output to be:

"Number of unique HTML tags: 60"

The following will yield 63 URL's from the url in question

import requests
import re

url = "https://en.wikipedia.org/wiki/Collatz_conjecture"
text = requests.get(url).text

url_pattern = r"((http(s)?://)([\w-]+\.)+[\w-]+[.com]+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?)"

# Get all matching patterns of url_pattern
# this will return a list of tuples 
# where we are only interested in the first item of the tuple
urls = re.findall(url_pattern, text)

# using list comprehension to get the first item of the tuple, 
# and the set function to filter out duplicates
unique_urls = set([x[0] for x in urls])
print(f'Number of unique HTML tags: {len(unique_urls)} found on {url}')

out:

Number of unique HTML tags: 63 found on https://en.wikipedia.org/wiki/Collatz_conjecture

Please! Do not parse a HTML in regex use modules like bs4. But still if you insist Do that as follows:

import requests
import re

url = 'https://en.wikipedia.org/wiki/Collatz_conjecture'
text = requests.get(url).text
tags = re.findall('<[^>]*>',text)

total=[]

for i in range(len(tags)):
    total.append(re.match('<[^\s\>]+',tags[i]).group())

total=[elem+'>' for elem in total]
r= re.compile('</[^<]')

unwanted =list(filter(r.match,total))

un=['<!-->','<!--[if>','<!DOCTYPE>','<![endif]-->']
unwanted.extend(un)

final=[x for x in list(set(total)) if x not in set(unwanted)]

print('Number of Unique HTML tags : ',len(final))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM