GHC / Haskell如何决定从哪个字符编码解码/编码？

Question

It seems that GHC is at least inconsistent in the character encoding it decides to decode from. 似乎GHC在它决定解码的字符编码中至少是不一致的。

Consider a file, omatase-shimashita.txt , with the following content, encoded in UTF-8: お待たせしました 考虑一个文件， omatase-shimashita.txt ，其中包含以UTF-8编码的内容：お待たせしました

readFile seems to read this in properly... readFile似乎正确地读了这个......

Prelude> content <- readFile "/home/chris/Desktop/omatase-shimashita.txt"
Prelude> length content
8
Prelude> putStrLn content
お待たせしました

However, if I write a simple "echo" server, it does not decode with a default of UTF-8. 但是，如果我编写一个简单的“echo”服务器，它不会使用默认的UTF-8进行解码。 Consider the following code that handles an incoming client: 请考虑以下处理传入客户端的代码：

handleClient handle = do
  line <- hGetLine handle
  putStrLn $ "Read following line: " ++ toString line
  handleClient handle

And the relevant client code, explicitly sending UTF-8: 以及相关的客户端代码，明确发送UTF-8：

Data.ByteString.hPutStrLn handle $ Codec.Binary.UTF8.Generic.fromString "お待たせしました"

Is this not inconsistent behavior? 这不是不一致的行为吗？ Is there any method to this madness? 有这种疯狂的方法吗？ I am planning to rewrite my application(s) to explicitly use ByteString objects and explicitly encode and decode using Codec.Binary.UTF8 , but it would be good to know what's going on here anyway... :o/ 我打算重写我的应用程序以显式使用ByteString对象并使用Codec.Binary.UTF8显式编码和解码，但最好还是知道这里发生了什么......：o /

UPDATE: I am running on Ubuntu Linux, version 10.10, with a locale of en_US.UTF-8... 更新：我在Ubuntu Linux版本10.10上运行，其语言环境为en_US.UTF-8 ...

$ cat /etc/default/locale 
LANG="en_US.UTF-8"
$ echo $LANG 
en_US.UTF-8

Answer 1

Which version of GHC are you using? 您使用的是哪个版本的GHC？ Older versions especially didn't do unicode I/O very well. 较旧的版本尤其不能很好地执行unicode I / O.

This section in the GHC documentation describes how to change input/output encodings: GHC文档中的这一部分描述了如何更改输入/输出编码：

http://haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/System-IO.html#23 http://haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/System-IO.html#23

Also, the documentation says this: 此外，文档说明了这一点：

A text-mode Handle has an associated TextEncoding, which is used to decode bytes into Unicode characters when reading, and encode Unicode characters into bytes when writing. 文本模式Handle具有关联的TextEncoding，用于在读取时将字节解码为Unicode字符，并在写入时将Unicode字符编码为字节。

The default TextEncoding is the same as the default encoding on your system, which is also available as localeEncoding. 默认的TextEncoding与系统上的默认编码相同，也可以作为localeEncoding使用。 (GHC note: on Windows, we currently do not support double-byte encodings; if the console's code page is unsupported, then localeEncoding will be latin1.) （GHC注意：在Windows上，我们目前不支持双字节编码;如果控制台的代码页不受支持，则localeEncoding将为latin1。）

Encoding and decoding errors are always detected and reported, except during lazy I/O (hGetContents, getContents, and readFile), where a decoding error merely results in termination of the character stream, as with other I/O errors. 始终检测并报告编码和解码错误，但在惰性I / O（hGetContents，getContents和readFile）期间除外，其中解码错误仅导致字符流的终止，与其他I / O错误一样。

Maybe this has something to do with your problem? 也许这与你的问题有关？ If GHC has defaulted to something other than utf-8 somewhere, or your handle has been manually set to use a different encoding, that might explain the problem. 如果GHC默认某个地方不是utf-8，或者你的句柄被手动设置为使用不同的编码，那么这可能解释了这个问题。 If you're just trying to echo text at the console, then probably some kind of console code-page funniness is going on. 如果你只是试图在控制台上回显文本，那么可能会出现某种控制台代码页的混乱。 I know I've had similar problems in the past with other languages like Python and printing unicode in a windows console. 我知道我在过去使用其他语言（例如Python）和在Windows控制台中打印unicode时遇到了类似的问题。

Try running hSetEncoding handle utf8 and see if it fixes your problem. 尝试运行hSetEncoding handle utf8 ，看看它是否修复了你的问题。

Answer 2

Your first example uses the standard IO library, System.IO . 您的第一个示例使用标准IO库System.IO 。 Operations in this library use the default system encoding (also known as localeEncoding ) unless you specify otherwise. 除非另行指定，否则此库中的操作使用默认系统编码（也称为localeEncoding ）。 Presumably your system is set up to use UTF-8, so that is the encoding used by putStrLn , hGetContents and so on. 据推测，您的系统设置为使用UTF-8，因此putStrLn ， hGetContents等使用的编码。

Your second example uses Data.ByteString . 您的第二个示例使用Data.ByteString 。 Since this library deals in sequences of bytes only, it does no encoding or decoding. 由于此库仅处理字节序列，因此不进行编码或解码。 So Data.ByteString.hGetLine converts the bytes in the file directly to a ByteString . 因此Data.ByteString.hGetLine将文件中的字节直接转换为ByteString 。

The best way to do text I/O in general is to use the text package. 一般来说，执行文本I / O的最佳方法是使用文本包。

GHC / Haskell如何决定从哪个字符编码解码/编码？

问题描述

2 个解决方案

解决方案1
6 已采纳 2011-03-13 10:32:55

解决方案2
6 2011-03-14 09:28:59

GHC / Haskell如何决定从哪个字符编码解码/编码？

问题描述

2 个解决方案

解决方案1 6 已采纳 2011-03-13 10:32:55

解决方案2 6 2011-03-14 09:28:59

解决方案1
6 已采纳 2011-03-13 10:32:55

解决方案2
6 2011-03-14 09:28:59