I have a following code in go:
import (
"log"
"net/http"
"code.google.com/p/go.text/transform"
"code.google.com/p/go.text/encoding/charmap"
)
...
res, err := http.Get(url)
if err != nil {
log.Println("Cannot read", url);
log.Println(err);
continue
}
defer res.Body.Close()
The page I load contain non UTF-8 symbols. So I try to use transform
utfBody := transform.NewReader(res.Body, charmap.Windows1251.NewDecoder())
But the problem is, that it returns error even in this simple scenarion:
bytes, err := ioutil.ReadAll(utfBody)
log.Println(err)
if err == nil {
log.Println(bytes)
}
transform: short destination buffer
It also actually sets bytes
with some data, but in my real code I use goquery
:
doc, err := goquery.NewDocumentFromReader(utfBody)
Which sees an error and fails with not data in return
I tried to pass "chunks" of res.Body
to transform.NewReader
and figuried out, that as long as res.Body contains no non-UTF8 data it works well. And when it contains non-UTF8 byte it fails with an error above.
I'm quite new to go and don't really understand what's going on and how to deal with this
Without the whole code along with an example URL it's hard to tell what exactly is going wrong here.
That said, I can recommend the golang.org/x/net/html/charset
package for this as it supports both char guessing and converting to UTF 8.
func fetchUtf8Bytes(url string) ([]byte, error) {
res, err := http.Get(url)
if err != nil {
return nil, err
}
contentType := res.Header.Get("Content-Type") // Optional, better guessing
utf8reader, err := charset.NewReader(res.Body, contentType)
if err != nil {
return nil, err
}
return ioutil.ReadAll(utf8reader)
}
Complete example: http://play.golang.org/p/olcBM9ughv
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.