[英]How does GHC/Haskell decide what character encoding it's going to decode/encode from/to?
It seems that GHC is at least inconsistent in the character encoding it decides to decode from. 似乎GHC在它决定解码的字符编码中至少是不一致的。
Consider a file, omatase-shimashita.txt
, with the following content, encoded in UTF-8: お待たせしました 考虑一个文件,
omatase-shimashita.txt
,其中包含以UTF-8编码的内容:お待たせしました
readFile
seems to read this in properly... readFile
似乎正确地读了这个......
Prelude> content <- readFile "/home/chris/Desktop/omatase-shimashita.txt"
Prelude> length content
8
Prelude> putStrLn content
お待たせしました
However, if I write a simple "echo" server, it does not decode with a default of UTF-8. 但是,如果我编写一个简单的“echo”服务器,它不会使用默认的UTF-8进行解码。 Consider the following code that handles an incoming client:
请考虑以下处理传入客户端的代码:
handleClient handle = do
line <- hGetLine handle
putStrLn $ "Read following line: " ++ toString line
handleClient handle
And the relevant client code, explicitly sending UTF-8: 以及相关的客户端代码,明确发送UTF-8:
Data.ByteString.hPutStrLn handle $ Codec.Binary.UTF8.Generic.fromString "お待たせしました"
Is this not inconsistent behavior? 这不是不一致的行为吗? Is there any method to this madness?
有这种疯狂的方法吗? I am planning to rewrite my application(s) to explicitly use
ByteString
objects and explicitly encode and decode using Codec.Binary.UTF8
, but it would be good to know what's going on here anyway... :o/ 我打算重写我的应用程序以显式使用
ByteString
对象并使用Codec.Binary.UTF8
显式编码和解码,但最好还是知道这里发生了什么......:o /
UPDATE: I am running on Ubuntu Linux, version 10.10, with a locale of en_US.UTF-8... 更新:我在Ubuntu Linux版本10.10上运行,其语言环境为en_US.UTF-8 ...
$ cat /etc/default/locale
LANG="en_US.UTF-8"
$ echo $LANG
en_US.UTF-8
Which version of GHC are you using? 您使用的是哪个版本的GHC? Older versions especially didn't do unicode I/O very well.
较旧的版本尤其不能很好地执行unicode I / O.
This section in the GHC documentation describes how to change input/output encodings: GHC文档中的这一部分描述了如何更改输入/输出编码:
http://haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/System-IO.html#23 http://haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/System-IO.html#23
Also, the documentation says this: 此外,文档说明了这一点:
A text-mode Handle has an associated TextEncoding, which is used to decode bytes into Unicode characters when reading, and encode Unicode characters into bytes when writing.
文本模式Handle具有关联的TextEncoding,用于在读取时将字节解码为Unicode字符,并在写入时将Unicode字符编码为字节。
The default TextEncoding is the same as the default encoding on your system, which is also available as localeEncoding.
默认的TextEncoding与系统上的默认编码相同,也可以作为localeEncoding使用。 (GHC note: on Windows, we currently do not support double-byte encodings; if the console's code page is unsupported, then localeEncoding will be latin1.)
(GHC注意:在Windows上,我们目前不支持双字节编码;如果控制台的代码页不受支持,则localeEncoding将为latin1。)
Encoding and decoding errors are always detected and reported, except during lazy I/O (hGetContents, getContents, and readFile), where a decoding error merely results in termination of the character stream, as with other I/O errors.
始终检测并报告编码和解码错误,但在惰性I / O(hGetContents,getContents和readFile)期间除外,其中解码错误仅导致字符流的终止,与其他I / O错误一样。
Maybe this has something to do with your problem? 也许这与你的问题有关? If GHC has defaulted to something other than utf-8 somewhere, or your handle has been manually set to use a different encoding, that might explain the problem.
如果GHC默认某个地方不是utf-8,或者你的句柄被手动设置为使用不同的编码,那么这可能解释了这个问题。 If you're just trying to echo text at the console, then probably some kind of console code-page funniness is going on.
如果你只是试图在控制台上回显文本,那么可能会出现某种控制台代码页的混乱。 I know I've had similar problems in the past with other languages like Python and printing unicode in a windows console.
我知道我在过去使用其他语言(例如Python)和在Windows控制台中打印unicode时遇到了类似的问题。
Try running hSetEncoding handle utf8
and see if it fixes your problem. 尝试运行
hSetEncoding handle utf8
,看看它是否修复了你的问题。
Your first example uses the standard IO library, System.IO
. 您的第一个示例使用标准IO库
System.IO
。 Operations in this library use the default system encoding (also known as localeEncoding
) unless you specify otherwise. 除非另行指定,否则此库中的操作使用默认系统编码(也称为
localeEncoding
)。 Presumably your system is set up to use UTF-8, so that is the encoding used by putStrLn
, hGetContents
and so on. 据推测,您的系统设置为使用UTF-8,因此
putStrLn
, hGetContents
等使用的编码。
Your second example uses Data.ByteString
. 您的第二个示例使用
Data.ByteString
。 Since this library deals in sequences of bytes only, it does no encoding or decoding. 由于此库仅处理字节序列,因此不进行编码或解码。 So
Data.ByteString.hGetLine
converts the bytes in the file directly to a ByteString
. 因此
Data.ByteString.hGetLine
将文件中的字节直接转换为ByteString
。
The best way to do text I/O in general is to use the text package. 一般来说,执行文本I / O的最佳方法是使用文本包。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.