简体   繁体   English

Golang中的字符串转换和Unicode

[英]String casting and Unicode in golang

I am reading Go Essentials : 我正在阅读Go Essentials

String in Go is an immutable sequence of bytes (8-bit byte values) This is different than languages like Python, C#, Java or Swift where strings are Unicode. Go中的字符串是字节的不可变序列(8位字节值),这与Python,C#,Java或Swift等语言(其中字符串为Unicode)不同。

I am playing around with following code: 我在玩以下代码:

s := "日本語"
b :=[]byte{0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 0xaa, 0x9e}
fmt.Println(string(b) == s) // true

for i, runeChar := range b {
    fmt.Printf("byte position %d: %#U\n", i, runeChar)
}

//byte position 0: U+00E6 'æ'
//byte position 1: U+0097
//byte position 2: U+00A5 '¥'
//byte position 3: U+00E6 'æ'
//byte position 4: U+009C
//byte position 5: U+00AC '¬'
//byte position 6: U+00E8 'è'
//byte position 7: U+00AA 'ª'
//byte position 8: U+009E

for i, runeChar := range string(b) {
    fmt.Printf("byte position %d: %#U\n", i, runeChar)
}

//byte position 0: U+65E5 '日'
//byte position 3: U+672C '本'
//byte position 6: U+8A9E '語'

Questions: 问题:

  1. From where does Golang get Unicode for encoding byte array when custing to string? 当从字符串捕获到字符串时,Golang从何处获得Unicode编码字节数组? How does rune form? rune如何形成? Does Golang compilator get Unicode from text file encoding during compilation? Golang编译器在编译期间是否从文本文件编码中获取Unicode?

  2. What are advantages and disadvantages of implementing String like a byte array, instead of utf-16 chars array like in Java? 以字节数组而不是Java中的utf-16 chars数组实现String的优缺点是什么?

You are quoting from a weak, unreliable source: Go Essentials: Strings . 您引用的是一个不可靠的可靠资源: Go Essentials:Strings Amongst other things, there is no mention of Unicode codepoints or UTF-8 encoding. 除其他外,没有提及Unicode代码点或UTF-8编码。


For example, 例如,

package main

import "fmt"

func main() {
    s := "日本語"
    fmt.Printf("Glyph:             %q\n", s)
    fmt.Printf("UTF-8:             [% x]\n", []byte(s))
    fmt.Printf("Unicode codepoint: %U\n", []rune(s))
}

Playground: https://play.golang.org/p/iaYd80Ocitg 游乐场: https : //play.golang.org/p/iaYd80Ocitg

Output: 输出:

Glyph:             "日本語"
UTF-8:             [e6 97 a5 e6 9c ac e8 aa 9e]
Unicode codepoint: [U+65E5 U+672C U+8A9E]

References: 参考文献:

The Go Blog: Strings, bytes, runes and characters in Go Go博客:Go中的字符串,字节,符文和字符

The Go Programming Language Specification Go编程语言规范

Unicode FAQ: UTF-8, UTF-16, UTF-32 & BOM Unicode常见问题解答:UTF-8,UTF-16,UTF-32和BOM

The Unicode Consortium Unicode联盟

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM