简体   繁体   English

GHC / Haskell如何决定从哪个字符编码解码/编码?

[英]How does GHC/Haskell decide what character encoding it's going to decode/encode from/to?

It seems that GHC is at least inconsistent in the character encoding it decides to decode from. 似乎GHC在它决定解码的字符编码中至少是不一致的。

Consider a file, omatase-shimashita.txt , with the following content, encoded in UTF-8: お待たせしました 考虑一个文件, omatase-shimashita.txt ,其中包含以UTF-8编码的内容:お待たせしました

readFile seems to read this in properly... readFile似乎正确地读了这个......

Prelude> content <- readFile "/home/chris/Desktop/omatase-shimashita.txt"
Prelude> length content
8
Prelude> putStrLn content
お待たせしました

However, if I write a simple "echo" server, it does not decode with a default of UTF-8. 但是,如果我编写一个简单的“echo”服务器,它不会使用默认的UTF-8进行解码。 Consider the following code that handles an incoming client: 请考虑以下处理传入客户端的代码:

handleClient handle = do
  line <- hGetLine handle
  putStrLn $ "Read following line: " ++ toString line
  handleClient handle

And the relevant client code, explicitly sending UTF-8: 以及相关的客户端代码,明确发送UTF-8:

Data.ByteString.hPutStrLn handle $ Codec.Binary.UTF8.Generic.fromString "お待たせしました"

Is this not inconsistent behavior? 这不是不一致的行为吗? Is there any method to this madness? 有这种疯狂的方法吗? I am planning to rewrite my application(s) to explicitly use ByteString objects and explicitly encode and decode using Codec.Binary.UTF8 , but it would be good to know what's going on here anyway... :o/ 我打算重写我的应用程序以显式使用ByteString对象并使用Codec.Binary.UTF8显式编码和解码,但最好还是知道这里发生了什么......:o /

UPDATE: I am running on Ubuntu Linux, version 10.10, with a locale of en_US.UTF-8... 更新:我在Ubuntu Linux版本10.10上运行,其语言环境为en_US.UTF-8 ...

$ cat /etc/default/locale 
LANG="en_US.UTF-8"
$ echo $LANG 
en_US.UTF-8

Which version of GHC are you using? 您使用的是哪个版本的GHC? Older versions especially didn't do unicode I/O very well. 较旧的版本尤其不能很好地执行unicode I / O.

This section in the GHC documentation describes how to change input/output encodings: GHC文档中的这一部分描述了如何更改输入/输出编码:

http://haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/System-IO.html#23 http://haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/System-IO.html#23

Also, the documentation says this: 此外,文档说明了这一点:

A text-mode Handle has an associated TextEncoding, which is used to decode bytes into Unicode characters when reading, and encode Unicode characters into bytes when writing. 文本模式Handle具有关联的TextEncoding,用于在读取时将字节解码为Unicode字符,并在写入时将Unicode字符编码为字节。

The default TextEncoding is the same as the default encoding on your system, which is also available as localeEncoding. 默认的TextEncoding与系统上的默认编码相同,也可以作为localeEncoding使用。 (GHC note: on Windows, we currently do not support double-byte encodings; if the console's code page is unsupported, then localeEncoding will be latin1.) (GHC注意:在Windows上,我们目前不支持双字节编码;如果控制台的代码页不受支持,则localeEncoding将为latin1。)

Encoding and decoding errors are always detected and reported, except during lazy I/O (hGetContents, getContents, and readFile), where a decoding error merely results in termination of the character stream, as with other I/O errors. 始终检测并报告编码和解码错误,但在惰性I / O(hGetContents,getContents和readFile)期间除外,其中解码错误仅导致字符流的终止,与其他I / O错误一样。

Maybe this has something to do with your problem? 也许这与你的问题有关? If GHC has defaulted to something other than utf-8 somewhere, or your handle has been manually set to use a different encoding, that might explain the problem. 如果GHC默认某个地方不是utf-8,或者你的句柄被手动设置为使用不同的编码,那么这可能解释了这个问题。 If you're just trying to echo text at the console, then probably some kind of console code-page funniness is going on. 如果你只是试图在控制台上回显文本,那么可能会出现某种控制台代码页的混乱。 I know I've had similar problems in the past with other languages like Python and printing unicode in a windows console. 我知道我在过去使用其他语言(例如Python)和在Windows控制台中打印unicode时遇到了类似的问题。

Try running hSetEncoding handle utf8 and see if it fixes your problem. 尝试运行hSetEncoding handle utf8 ,看看它是否修复了你的问题。

Your first example uses the standard IO library, System.IO . 您的第一个示例使用标准IO库System.IO Operations in this library use the default system encoding (also known as localeEncoding ) unless you specify otherwise. 除非另行指定,否则此库中的操作使用默认系统编码(也称为localeEncoding )。 Presumably your system is set up to use UTF-8, so that is the encoding used by putStrLn , hGetContents and so on. 据推测,您的系统设置为使用UTF-8,因此putStrLnhGetContents等使用的编码。

Your second example uses Data.ByteString . 您的第二个示例使用Data.ByteString Since this library deals in sequences of bytes only, it does no encoding or decoding. 由于此库仅处理字节序列,因此不进行编码或解码。 So Data.ByteString.hGetLine converts the bytes in the file directly to a ByteString . 因此Data.ByteString.hGetLine将文件中的字节直接转换为ByteString

The best way to do text I/O in general is to use the text package. 一般来说,执行文本I / O的最佳方法是使用文本包。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM