简体   繁体   中英

Go encoding transform issue

I have a following code in go:

import (
    "log"
    "net/http"
    "code.google.com/p/go.text/transform"
    "code.google.com/p/go.text/encoding/charmap"

)

...

res, err := http.Get(url)
if err != nil {
    log.Println("Cannot read", url);
    log.Println(err);
    continue
}
defer res.Body.Close()

The page I load contain non UTF-8 symbols. So I try to use transform

utfBody := transform.NewReader(res.Body, charmap.Windows1251.NewDecoder())

But the problem is, that it returns error even in this simple scenarion:

bytes, err := ioutil.ReadAll(utfBody)
log.Println(err)
if err == nil {
    log.Println(bytes)
}

transform: short destination buffer

It also actually sets bytes with some data, but in my real code I use goquery :

doc, err := goquery.NewDocumentFromReader(utfBody)

Which sees an error and fails with not data in return

I tried to pass "chunks" of res.Body to transform.NewReader and figuried out, that as long as res.Body contains no non-UTF8 data it works well. And when it contains non-UTF8 byte it fails with an error above.

I'm quite new to go and don't really understand what's going on and how to deal with this

Without the whole code along with an example URL it's hard to tell what exactly is going wrong here.

That said, I can recommend the golang.org/x/net/html/charset package for this as it supports both char guessing and converting to UTF 8.

func fetchUtf8Bytes(url string) ([]byte, error) {
    res, err := http.Get(url)
    if err != nil {
        return nil, err
    }

    contentType := res.Header.Get("Content-Type") // Optional, better guessing
    utf8reader, err := charset.NewReader(res.Body, contentType)
    if err != nil {
        return nil, err
    }

    return ioutil.ReadAll(utf8reader)
}

Complete example: http://play.golang.org/p/olcBM9ughv

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM