简体   繁体   中英

Inverse of `Data.Text.Encoding.decodeLatin1`?

Is there a function f:: Text -> Maybe ByteString such that forall x :

f (decodeLatin1 x) == Just x

Note, decodeLatin1 has the signature:

decodeLatin1 :: ByteString -> Text

I'm concerned that encodeUtf8 is not what I want, as I'm guessing what it does is just dump the UTF-8 string out as a ByteString, not reverse the changes that decodeLatin1 made on the way in to characters in the upper half of the character set.

I understand that f has to return a Maybe , because in general there's Unicode characters that aren't in the Latin character set, but I just want this to round trip at least, in that if we start with a ByteString we should get back to it.

DISCLAIMER: consider this a long comment rather than a solution, because I haven't tested.

I think you can do it with witch library . It is a general purpose type converter library with a fair amount of type safety. There is a type class called TryFrom to perform conversion between types that might fail to cast.

Luckily witch provides conversions from/to encondings too, having an instance TryFrom Text (ISO_8859_1 ByteString) , meaning that you can convert between Text and latin1 encoded ByteString . So I think (not tested!!) this should work

{-# LANGUAGE TypeApplications #-}

import Witch (tryInto, ISO_8859_1)
import Data.Tagged (Tagged(unTagged))

f :: Text -> Maybe ByteString
f s = case tryInto @(ISO_8859_1 ByteString) s of
  Left err -> Nothing
  Right bs -> Just (unTagged bs)

Notice that tryInto returns a Either TryFromException s , so if you want to handle errors you can do it with Either . Up to you.

Also, witch docs points out that this conversion is done via String type, so probably there is an out-of-the-box solution without the need of depending on witch package. I don't know such a solution, and looking to the source code hasn't helped

Edit:

Having read witch source code aparently this should work

import qualified Data.Text as T
import Data.Char (isLatin1)
import qualified Data.ByteString.Char8 as C

f :: Text -> Maybe ByteString
f t = if allCharsAreLatin then Just (C.pack str) else Nothing
 where str = T.unpack t
       allCharsAreLatin = all isLatin1 str

The latin1 encoding is pretty damn simple -- codepoint X maps to byte X, whenever that's in range of a byte. So just unpack and repack immediately.

import Control.Monad
import qualified Data.Text as T
import qualified Data.ByteString.Char8 as BS

latin1EncodeText :: T.Text -> Maybe BS.ByteString
latin1EncodeText t = BS.pack (T.unpack t) <$ guard (T.all (<'\256') t)

It's possible to avoid the intermediate String , but you should probably make sure this is your bottleneck before trying for that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM