简体   繁体   中英

Matching multiple unicode characters in Golang Regexp

As a simplified example, I want to get ^⬛+$ matched against ⬛⬛⬛ to yield a find match of ⬛⬛⬛ .

    r := regexp.MustCompile("^⬛+$")
    matches := r.FindString("⬛️⬛️⬛️")
    fmt.Println(matches)

But it doesn't match successfully even though this would work with regular ASCII characters.

I'm guessing there's something I don't know about Unicode matching, but I haven't found any decent explanation in documentation yet.

Can someone explain the problem?

Go Play

You need to account for all chars in the string. If you analyze the string you will see it contains:

在此处输入图像描述

So you need a regex that will match a string containing one or more combinations of \x{2B1B} and \x{FE0F} chars till end of string.

So you need to use

^(?:\x{2B1B}\x{FE0F})+$

See the regex demo .

Note you can use \p{M} to match any diacritic mark:

^(?:\x{2B1B}\p{M})+$

See the Go demo :

package main

import (
    "fmt"
    "regexp"
)

func main() {
    r := regexp.MustCompile(`^(?:\x{2B1B}\x{FE0F})+$`)
    matches := r.FindString("⬛️⬛️⬛️")
    fmt.Println(matches)
}

The regular expression matches a string containing one or more ⬛ (black square box).

The subject string is three pairs of black square box and variation selector-16. The variation selectors are invisible (on my terminal) and prevent a match.

Fix by removing the variation selectors from the subject string or adding the variation selector to the pattern.

Here's the first fix: https://go.dev/play/p/oKIVnkC7TZ1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM