简体   繁体   English

计算golang字符串中的字符

[英]Counting characters in golang string

I am trying to count "characters" in go. That is, if a string contains one printable "glyph", or "composed character" (or what someone would ordinarily think of as a character), I want it to count 1. For example, the string "Hello, 世界", should count 11, since there are 11 characters, and a human would look at this and say there are 11 glyphs.我正在尝试计算 go 中的“字符”。也就是说,如果一个字符串包含一个可打印的“字形”或“组合字符”(或者某些人通常认为的字符),我希望它计数为 1。对于例如,字符串“Hello, World”应该算作 11,因为有 11 个字符,而人类看到它会说有 11 个字形。

utf8.RuneCountInString() works well in most cases, including ascii, accents, asian characters and even emojis. utf8.RuneCountInString() 在大多数情况下效果很好,包括 ascii、重音符号、亚洲字符甚至表情符号。 However, as I understand it runes correspond to code points, not characters.但是,据我了解,符文对应于代码点,而不是字符。 When I try to use basic emojis it works, but when I use emojis that have different skin tones, I get the wrong count: https://play.golang.org/p/aFIGsB6MsO当我尝试使用基本的表情符号时它起作用了,但是当我使用具有不同肤色的表情符号时,我得到了错误的计数: https://play.golang.org/p/aFIGsB6MsO

From what I read here and here the following should work, but I still don't seem to be getting the right results (it over-counts):从我在这里这里读到的内容来看,以下内容应该有效,但我似乎仍然没有得到正确的结果(多算):

func CountCharactersInString(str string) int {
    var ia norm.Iter
    ia.InitString(norm.NFC, str)
    nc := 0
    for !ia.Done() {
        nc = nc + 1
        ia.Next()
    }
    return nc
}

This doesn't work either:这也不起作用:

func GraphemeCountInString(str string) int {
    re := regexp.MustCompile("\\PM\\pM*|.")
    return len(re.FindAllString(str, -1))
}

I am looking for something similar to this in Objective C:我在目标 C 中寻找与此类似的东西:

+ (NSInteger)countCharactersInString:(NSString *) string {
    // --- Calculate the number of characters enterd by user and update character count label
    NSInteger count = 0;
    NSUInteger index = 0;
    while (index < string.length) {
        NSRange range = [string rangeOfComposedCharacterSequenceAtIndex:index];
        count++;
        index += range.length;
    }
    return count;
 }

Have you tried strings.Count ?你试过strings.Count吗?

package main

import (
     "fmt"
     "strings"
 )

 func main() {
     fmt.Println(strings.Count("Hello, 世🖖🖖界", "🖖")) // Returns 2
 }

I wrote a package that allows you to do this: https://github.com/rivo/uniseg .我写了一个允许你这样做的包: https : //github.com/rivo/uniseg It breaks strings according to the rules specified in Unicode Standard Annex #29 which is what you are looking for.它根据您正在寻找的Unicode 标准附件 #29 中指定的规则来拆分字符串。 Here is how you would use it in your case:以下是在您的情况下如何使用它:

package main

import (
    "fmt"

    "github.com/rivo/uniseg"
)

func main() {
    fmt.Println(uniseg.GraphemeClusterCount("Hello, 世🖖🏿🖖界"))
}

This will print 11 as you expect.这将按照您的预期打印11

Straight forward natively use the utf8.RuneCountInString()直接使用utf8.RuneCountInString()

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    str := "Hello, 世🖖🖖界"
    fmt.Println("counts =", utf8.RuneCountInString(str))
}

Reference to the example of API document.参考API文档示例。 https://golang.org/pkg/unicode/utf8/#example_DecodeLastRuneInString https://golang.org/pkg/unicode/utf8/#example_DecodeLastRuneInString

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    str := "Hello, 世🖖界"
    count := 0
    for len(str) > 0 {
        r, size := utf8.DecodeLastRuneInString(str)
        count++
        fmt.Printf("%c %v\n", r, size)

        str = str[:len(str)-size]
    }
    fmt.Println("count:",count)
}

I think the easiest way to do this would be like this:我认为最简单的方法是这样的:

package main

import "fmt"

func main() {
    str := "Hello, 世🖖🖖界"
    var counter int
    for range str {
        counter++
    }
    fmt.Println(counter)
}

This one prints 11这一张印了11

To count letter frequencies in a big file计算大文件中的字母频率

package main

import (
    "fmt"
    "io"
    "log"
    "strings"
    "unicode"
)

func countLetters(r io.Reader) (map[string]int, error) {
    buf := make([]byte, 2048)
    out := map[string]int{}
    for {
        n, err := r.Read(buf)
        str := string(buf[:n])
        for _, s := range str {
            if unicode.IsLetter(s) {
                out[string(s)]++
            }
        }
        if err == io.EOF {
            return out, nil
        }
        if err != nil {
            return nil, err
        }
    }
}

func main() {
    r := strings.NewReader("hello 世界 !")
    counts, err := countLetters(r)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(counts) // map[e:1 h:1 l:2 o:1 世:1 界:1]

}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM