I have html source code from an HTML page:
import requests
text = requests.get("https://en.wikipedia.org/wiki/Collatz_conjecture").text
What I would like to do is to get a count of the number of unique HTML tags on this page.
For example: <head>
, <title>
. Closing tags do not count ( <head>
and </head>
would be counted only once).
Yes, I know this is much easier using HTML parsers such as Beautiful Soup but I would like to accomplish this using only Regular Expression.
I've brute force counted this and the answer is in the ballpark of around 60 unique tags. How would I go about doing this?
I've already tried using re.findall()
, to no avail.
Since the answer is around 60, I would like the output to be:
"Number of unique HTML tags: 60"
The following will yield 63 URL's from the url in question
import requests
import re
url = "https://en.wikipedia.org/wiki/Collatz_conjecture"
text = requests.get(url).text
url_pattern = r"((http(s)?://)([\w-]+\.)+[\w-]+[.com]+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?)"
# Get all matching patterns of url_pattern
# this will return a list of tuples
# where we are only interested in the first item of the tuple
urls = re.findall(url_pattern, text)
# using list comprehension to get the first item of the tuple,
# and the set function to filter out duplicates
unique_urls = set([x[0] for x in urls])
print(f'Number of unique HTML tags: {len(unique_urls)} found on {url}')
out:
Number of unique HTML tags: 63 found on https://en.wikipedia.org/wiki/Collatz_conjecture
Please! Do not parse a HTML in regex use modules like bs4. But still if you insist Do that as follows:
import requests
import re
url = 'https://en.wikipedia.org/wiki/Collatz_conjecture'
text = requests.get(url).text
tags = re.findall('<[^>]*>',text)
total=[]
for i in range(len(tags)):
total.append(re.match('<[^\s\>]+',tags[i]).group())
total=[elem+'>' for elem in total]
r= re.compile('</[^<]')
unwanted =list(filter(r.match,total))
un=['<!-->','<!--[if>','<!DOCTYPE>','<![endif]-->']
unwanted.extend(un)
final=[x for x in list(set(total)) if x not in set(unwanted)]
print('Number of Unique HTML tags : ',len(final))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.