I am a beginner with Regex so I keep practicing by solving all the exercises I can find. In one of them, I need to extract all the Hex codes from a HTML source code, using Regex and Python. According to the exercise, the rules for spotting a Hex code are:
The sample input is this:
#BED { color: #FfFdF8; background-color:#aef; font-size: 123px; background: -webkit-linear-gradient(top, #f9f9f9, #fff); } #Cab { background-color: #ABC; border: 2px dashed #fff; }
The desired output is:
#FfFdF8 #aef #f9f9f9 #fff #ABC #fff
#BED
and #Cab
are to be omitted, because they are not Hex colors.
I tried this code, to solve the problem:
import re
text = """
#BED
{
color: #FfFdF8; background-color:#aef;
font-size: 123px;
background: -webkit-linear-gradient(top, #f9f9f9, #fff);
}
#Cab
{
background-color: #ABC;
border: 2px dashed #fff;
} """
r = re.compile(r'#[0-9A-Fa-f]{3}|[0-9A-Fa-f]{6}')
a = r.findall(text)
print(a)
Obtained output:
['#BED', '#FfF', '#aef', '#f9f', '#fff', '#Cab', '#ABC', '#fff']
It works fine, except that it doesn't catch the 6-digit codes and it doesn't eliminate the two tags that actually are not Hex color codes.
What am I mistaking? I looked at other attempts, but they didn't provide the correct answer. I am using Python 3.7.4 and the latest version of PyCharm.
On one hand, you could match the 6-digit codes first , else matching the 3-digit codes will match half of them first (and thus not match the full 6-digit codes). But since you also want to match only CSS property rules, and not selectors, lookahead for ;
, ,
, or )
:
(?i)#(?:[0-9a-f]{6}|[0-9a-f]{3})(?=[;,)])
https://regex101.com/r/BtZaoV/2
If you also need to be able to exclude combined selectors, eg #BED, foo {
, you could lookahead for non- {
s followed by }
:
(?i)#(?:[0-9a-f]{6}|[0-9a-f]{3})(?=[^{]*})
https://regex101.com/r/BtZaoV/3
Use the case-insensitive flag to keep things DRY. (you could also use {3}){1,2}
to keep from repeating the character set, but that'll make the pattern harder to read IMO)
You can try
#(?:[0-9A-Fa-f]{6}|[0-9A-Fa-f]{3})(?=;|[^(]*\))
So here idea is match 6
character length with higher priority if not found match 3
character match, to ensure it doesn't match #BED
or something we need to match the termination of hex color code, so we use lookahead with alternation
You may use
r = re.compile(r'#[0-9A-Fa-f]{3}(?:[0-9A-Fa-f]{3})?(?!$)', re.M)
See proof
Sample Python code:
import re
regex = r"#[0-9A-Fa-f]{3}(?:[0-9A-Fa-f]{3})?(?!$)"
test_str = ("#BED\n"
"{\n"
" color: #FfFdF8; background-color:#aef;\n"
" font-size: 123px;\n"
" background: -webkit-linear-gradient(top, #f9f9f9, #fff);\n"
"}\n"
"#Cab\n"
"{\n"
" background-color: #ABC;\n"
" border: 2px dashed #fff;\n"
"}")
matches = re.findall(regex, test_str, re.MULTILINE)
print(matches)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.