简体   繁体   English

在 Go 中读取带有 BOM 的文件

[英]Reading files with a BOM in Go

I need to read Unicode files that may or may not contain a byte-order mark.我需要读取可能包含或不包含字节顺序标记的 Unicode 文件。 I could of course check the first few bytes of the file myself, and discard a BOM if I find one.我当然可以自己检查文件的前几个字节,如果找到 BOM,则丢弃 BOM。 But before I do, is there any standard way of doing this, either in the core libraries or a third party?但在我这样做之前,是否有任何标准方法可以做到这一点,无论是在核心库中还是在第三方中?

No standard way, IIRC (and the standard library would really be a wrong layer to implement such a check in) so here are two examples of how you could deal with it yourself.没有标准的方法,IIRC(标准库真的是一个错误的层来实现这种签入)所以这里有两个你可以如何自己处理它的例子。

One is to use a buffered reader above your data stream:一种是在数据流上方使用缓冲读取器:

import (
    "bufio"
    "os"
    "log"
)

func main() {
    fd, err := os.Open("filename")
    if err != nil {
        log.Fatal(err)
    }
    defer closeOrDie(fd)
    br := bufio.NewReader(fd)
    r, _, err := br.ReadRune()
    if err != nil {
        log.Fatal(err)
    }
    if r != '\uFEFF' {
        br.UnreadRune() // Not a BOM -- put the rune back
    }
    // Now work with br as you would do with fd
    // ...
}

Another approach, which works with objects implementing the io.Seeker interface, is to read the first three bytes and if they're not BOM, io.Seek() back to the beginning, like in:另一种适用于实现io.Seeker接口的对象的io.Seeker是读取前三个字节,如果它们不是 BOM,则io.Seek()回到开头,例如:

import (
    "os"
    "log"
)

func main() {
    fd, err := os.Open("filename")
    if err != nil {
        log.Fatal(err)
    }
    defer closeOrDie(fd)
    bom := [3]byte
    _, err = io.ReadFull(fd, bom[:])
    if err != nil {
        log.Fatal(err)
    }
    if bom[0] != 0xef || bom[1] != 0xbb || bom[2] != 0xbf {
        _, err = fd.Seek(0, 0) // Not a BOM -- seek back to the beginning
        if err != nil {
            log.Fatal(err)
        }
    }
    // The next read operation on fd will read real data
    // ...
}

This is possible since instances of *os.File (what os.Open() returns) support seeking and hence implement io.Seeker .这是可能的,因为*os.File实例( os.Open()返回的内容)支持搜索并因此实现io.Seeker Note that that's not the case for, say, Body reader of HTTP responses since you can't "rewind" it.请注意,对于 HTTP 响应的Body阅读器而言,情况并非如此,因为您无法“倒带”它。 bufio.Buffer works around this feature of non-seekable streams by performing some buffering (obviously) — that's what allows you yo UnreadRune() on it. bufio.Buffer通过执行一些缓冲(显然)来解决不可搜索流的这个特性——这就是让你在它上面使用UnreadRune()的原因。

Note that both examples assume the file we're dealing with is encoded in UTF-8.请注意,这两个示例都假设我们正在处理的文件是用 UTF-8 编码的。 If you need to deal with other (or unknown) encoding, things get more complicated.如果您需要处理其他(或未知)编码,事情会变得更加复杂。

There's no standard way of doing this in the Go core packages.在 Go 核心包中没有这样做的标准方法。 Follow the Unicode standard.遵循 Unicode 标准。

Unicode Byte Order Mark (BOM) FAQ Unicode 字节顺序标记 (BOM) 常见问题

You can use utfbom package.您可以使用utfbom包。 It wraps io.Reader , detects and discards BOM as necessary.它包装io.Reader ,根据需要检测并丢弃 BOM。 It can also return the encoding detected by the BOM.它还可以返回 BOM 检测到的编码。

I thought I would add here the way to strip the Byte Order Mark sequence from a string -- rather than messing around with bytes directly (as shown above).我以为我会在这里添加到字符串中剥离字节顺序标记序列的方式-而不是直接字节乱搞(如上图所示)。

package main

import (
    "fmt"
    "strings"
)

func main() {
    s := "\uFEFF is a string that starts with a Byte Order Mark"
    fmt.Printf("before: '%v' (len=%v)\n", s, len(s))

    ByteOrderMarkAsString := string('\uFEFF')

    if strings.HasPrefix(s, ByteOrderMarkAsString) {

        fmt.Printf("Found leading Byte Order Mark sequence!\n")
        
        s = strings.TrimPrefix(s, ByteOrderMarkAsString)
    }
    fmt.Printf("after: '%v' (len=%v)\n", s, len(s)) 
}

Other "strings" functions should work as well.其他“字符串”函数也应该工作。

And this is what prints out:这是打印出来的:

before: ' is a string that starts with a Byte Order Mark (len=50)'
Found leading Byte Order Mark sequence!
after: ' is a string that starts with a Byte Order Mark (len=47)'

Cheers!干杯!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM