简体   繁体   中英

How can I Correctly Parse a Hex Color Code in Python using Regex?

I am a beginner with Regex so I keep practicing by solving all the exercises I can find. In one of them, I need to extract all the Hex codes from a HTML source code, using Regex and Python. According to the exercise, the rules for spotting a Hex code are:

  1. It starts with #
  2. It has 3 or 6 digits
  3. Each digit is in the range of 0-F (the string is case insensitive)

The sample input is this:

 #BED { color: #FfFdF8; background-color:#aef; font-size: 123px; background: -webkit-linear-gradient(top, #f9f9f9, #fff); } #Cab { background-color: #ABC; border: 2px dashed #fff; }

The desired output is:

 #FfFdF8 #aef #f9f9f9 #fff #ABC #fff

#BED and #Cab are to be omitted, because they are not Hex colors.

I tried this code, to solve the problem:

import re

text = """
#BED
{
    color: #FfFdF8; background-color:#aef;
    font-size: 123px;
    background: -webkit-linear-gradient(top, #f9f9f9, #fff);
}
#Cab
{
    background-color: #ABC;
    border: 2px dashed #fff;
} """

r = re.compile(r'#[0-9A-Fa-f]{3}|[0-9A-Fa-f]{6}')
a = r.findall(text)
print(a)

Obtained output:

['#BED', '#FfF', '#aef', '#f9f', '#fff', '#Cab', '#ABC', '#fff']

It works fine, except that it doesn't catch the 6-digit codes and it doesn't eliminate the two tags that actually are not Hex color codes.

What am I mistaking? I looked at other attempts, but they didn't provide the correct answer. I am using Python 3.7.4 and the latest version of PyCharm.

On one hand, you could match the 6-digit codes first , else matching the 3-digit codes will match half of them first (and thus not match the full 6-digit codes). But since you also want to match only CSS property rules, and not selectors, lookahead for ; , , , or ) :

(?i)#(?:[0-9a-f]{6}|[0-9a-f]{3})(?=[;,)])

https://regex101.com/r/BtZaoV/2

If you also need to be able to exclude combined selectors, eg #BED, foo { , you could lookahead for non- { s followed by } :

(?i)#(?:[0-9a-f]{6}|[0-9a-f]{3})(?=[^{]*})

https://regex101.com/r/BtZaoV/3

Use the case-insensitive flag to keep things DRY. (you could also use {3}){1,2} to keep from repeating the character set, but that'll make the pattern harder to read IMO)

You can try

#(?:[0-9A-Fa-f]{6}|[0-9A-Fa-f]{3})(?=;|[^(]*\))

So here idea is match 6 character length with higher priority if not found match 3 character match, to ensure it doesn't match #BED or something we need to match the termination of hex color code, so we use lookahead with alternation

在此处输入图像描述

Regex Demo

You may use

r = re.compile(r'#[0-9A-Fa-f]{3}(?:[0-9A-Fa-f]{3})?(?!$)', re.M)

See proof

Sample Python code:

import re
regex = r"#[0-9A-Fa-f]{3}(?:[0-9A-Fa-f]{3})?(?!$)"
test_str = ("#BED\n"
    "{\n"
    "    color: #FfFdF8; background-color:#aef;\n"
    "    font-size: 123px;\n"
    "    background: -webkit-linear-gradient(top, #f9f9f9, #fff);\n"
    "}\n"
    "#Cab\n"
    "{\n"
    "    background-color: #ABC;\n"
    "    border: 2px dashed #fff;\n"
    "}")
matches = re.findall(regex, test_str, re.MULTILINE)
print(matches)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM