简体   繁体   中英

Haskell source encoding

The Haskell 2010 Language Report says:

Haskell uses the Unicode [2] character set. However, source programs are currently biased toward the ASCII character set used in earlier versions of Haskell.

Does this mean UTF-8?

In ghc-7.0.4/compiler/parser/Lexer.x.source:

$unispace    = \x05 -- Trick Alex into handling Unicode. See alexGetChar.
$whitechar   = [\ \n\r\f\v $unispace]
$white_no_nl = $whitechar # \n
$tab         = \t

$ascdigit  = 0-9
$unidigit  = \x03 -- Trick Alex into handling Unicode. See alexGetChar.
$decdigit  = $ascdigit -- for now, should really be $digit (ToDo)
$digit     = [$ascdigit $unidigit]

$special   = [\(\)\,\;\[\]\`\{\}]
$ascsymbol = [\!\#\$\%\&\*\+\.\/\<\=\>\?\@\\\^\|\-\~]
$unisymbol = \x04 -- Trick Alex into handling Unicode. See alexGetChar.
$symbol    = [$ascsymbol $unisymbol] # [$special \_\:\"\']

$unilarge  = \x01 -- Trick Alex into handling Unicode. See alexGetChar.
$asclarge  = [A-Z]
$large     = [$asclarge $unilarge]

$unismall  = \x02 -- Trick Alex into handling Unicode. See alexGetChar.
$ascsmall  = [a-z]
$small     = [$ascsmall $unismall \_]

$unigraphic = \x06 -- Trick Alex into handling Unicode. See alexGetChar.
$graphic   = [$small $large $symbol $digit $special $unigraphic \:\"\']

...I'm not sure what to make of this. alexGetChar wasn't really helpful.

Unicode is character set. UTF-8, UTF-16 etc are the concrete physical encodings of Unicode codepoints. Try to read here . The difference explained pretty well there.

Cited report's part just states that Haskell sources use Unicode character set. It doesn't state which encoding should be used at all. In other words, it says which characters could appear in the sources, but doesn't say how they could be written in term of plain bytes.

There was a proposal to standardize on UTF-8 as the standard encoding of Haskell source files, but I'm not sure if it was accepted or not.

In practice, GHC assumes all input files are UTF-8, but it ignores malformed byte sequences in comments.

While the Haskell standard simply says Unicode the set of possible characters (as opposed to eg ASCII or Latin-1) it doesn't specify which of the several different encodings (UTF8 UTF16, UTF32, byte order) to use.

Alex, the lexer that comes with the Haskell Platform requires its input to be UTF8 encoded * which is why you see the code you mention. In practice I think all the major implementations of Haskell require source to be in UTF8.

* - This is actually a real problem as GHC stores strings and more importantly Data.Text internally as UTF16. It would be nice to be able to lex these directly rather then converting back and forth.

There is an important distinction between the data type (ie what “abstract” data you can work with) and its representation (ie how it is stored in the computer memory or on disk).

The Haskell Report says two things related to Unicode:

  1. That the Char data type in Haskell represents a Unicode character (also known as code point ). You should think of it as of an abstract data type that provides a certain interface (eg you can call isDigit or toLower on it), but you are not allowed to know how exactly it is represented internally. The specific implementation of Haskell (eg GHC) is free to represent it in memory in whatever way it wants and it doesn't matter at all, as you can't access the underlying raw bits anyway.

  2. That a Haskell program is text, consisting of (abstract) Unicode code points, that is, essentially, a String . And then it goes on to explain how to parse this String . Once again, it is important to stress that it defines the syntax of Haskell in terms of sequences of abstract Unicode code points.

Now, to your question about Haskell source code. The Haskell Report does not specify how this Unicode text is encoded into zeroes and ones when stored in a file.

In fact, the Haskell Report does not specify how Haskell programs are stored at all, It doesn't mention that Haskell source code is stored in files, that files have to be named after modules, and that the directory structure should follow the structure of module names – these all are considered to be compiler implementation details: and the idea is that this allows each compiler to store Haskell programs wherever and however they want, in files, in database tables. as jpeg photos of a blackboard with a program written on it with chalk. For this reason it does not specify the encoding either (it would make no sense to specify the encoding for a program written out on a blackboard ).

However, GHC, the de-facto standard Haskell compiler, assumes that Haskell programs are stored in files encoded as UTF-8, organised hierarchically, and named after module names.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM