简体   繁体   English

在Haskell中解析CSV / TSV文件-Unicode字符

[英]Parse CSV/TSV file in Haskell - Unicode Characters

I'm trying to parse a tab-delimited file using cassava/Data.Csv in Haskell. 我正在尝试使用Haskell中的cassava / Data.Csv解析制表符分隔的文件。 However, I get problems if there are "strange" (Unicode) characters in my CSV file. 但是,如果我的CSV文件中包含“奇怪”(Unicode)字符,我会遇到问题。 I'll get a parse error (endOfInput) then. 然后,我将收到一个parse error (endOfInput)

According to the command-line tool "file", my file has a "UTF-8 Unicode text" decoding. 根据命令行工具“文件”,我的文件具有“ UTF-8 Unicode文本”解码。 My Haskell code looks like this: 我的Haskell代码如下所示:

{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE OverloadedStrings #-}

import qualified Data.ByteString as C
import qualified System.IO.UTF8 as U
import qualified Data.ByteString.UTF8 as UB
import qualified Data.ByteString.Lazy.Char8 as DL
import qualified Codec.Binary.UTF8.String as US
import qualified Data.Text.Lazy.Encoding as EL
import qualified Data.ByteString.Lazy as L

import Data.Text.Encoding as E

-- Handle CSV / TSV files with ...
import Data.Csv
import qualified Data.Vector as V

import Data.Char -- ord

csvFile :: FilePath
csvFile = "myFile.txt"

-- Set delimiter to \t (tabulator)
myOptions = defaultDecodeOptions {
              decDelimiter = fromIntegral (ord '\t')
            }

main :: IO ()
main = do
  csvData <- L.readFile csvFile 
  case EL.decodeUtf8' csvData of 
   Left err -> print err
   Right dat ->
     case decodeWith myOptions NoHeader $ EL.encodeUtf8 dat of
       Left err -> putStrLn err
       Right v -> V.forM_ v $ \ (category :: String ,
                               user :: String ,
                               date :: String,
                               time :: String,
                               message :: String) -> do
         print message

I tried using decodingUtf8', preprocessing (filtering) the input with predicates from Data.Char , and much more. 我尝试使用解码Utf8 ',使用来自Data.Char的谓词对输入进行预处理(过滤)等。 However the endOfFile error persists. 但是,endOfFile错误仍然存​​在。

My CSV-file looks like this: 我的CSV文件如下所示:

a   -   -   -   RT USE " Kenny" • Hahahahahahahahaha. #Emmen #Brandstapel
a   -   -   -   Uhm .. wat dan ook ????!!!! 👋

Or more literally: 或更确切地说:

a\t-\t-\t-\tRT USE " Kenny" • Hahahahahahahahaha. #Emmen #Brandstapel
a\t-\t-\t-\tUhm .. wat dan ook ????!!!! 👋

The problem chars are the 👋 and • (and in my complete file, there are many more of similar characters). 问题字符为👋和•(在我的完整文件中,还有更多类似的字符)。 What can I do, so that cassava / Data.Csv can read my file properly? 我该怎么办,以便cassava / Data.Csv可以正确读取我的文件?

EDIT: I've created the following preprocessor for escaping my Text before decoding it with cassava (see tibbe's answer). 编辑:我创建了以下预处理程序,用于在使用木薯解码文本之前对文本进行转义(请参见tibbe的回答)。 There's probably a better possibility, but so far, that works fine! 可能存在更好的可能性,但到目前为止,效果很好!

import qualified Data.Text as T

preprocess :: T.Text -> T.Text
preprocess txt = cons '\"' $ T.snoc escaped '\"'
  where escaped = T.concatMap escaper txt

escaper :: Char -> T.Text
escaper c
  | c == '\t' = "\"\t\""
  | c == '\n' = "\"\n\""
  | c == '\"' = "\"\""
  | otherwise = T.singleton c

Per the cassava documentation: 根据木薯文件:

  • Non-escaped fields may contain any characters except double-quotes, commas, carriage returns, and newlines. 非转义字段可以包含除双引号,逗号,回车和换行符以外的任何字符。

  • Escaped fields may contain any characters (but double-quotes need to be escaped). 转义的字段可以包含任何字符(但需要对双引号进行转义)。

Since the last field in your first record contains double quotes the field needs to be escaped with double quotes and any double quotes need to be escaped, like so: 由于第一条记录中的最后一个字段包含双引号,因此需要使用双引号对字段进行转义,并且需要对任何双引号进行转义,如下所示:

a   -   -   -   "RT USE "" Kenny"" • Hahahahahahahahaha. #Emmen #Brandstapel"

This code works for me: 该代码对我有用:

import Data.ByteString.Lazy
import Data.Char
import Data.Csv
import Data.Text.Encoding
import Data.Vector

test :: Either String (Vector (String, String, String, String, String))
test = decodeWith
    defaultDecodeOptions {decDelimiter = fromIntegral $ ord '\t' }
    NoHeader
    (fromStrict $ encodeUtf8 "a\t-\t-\t-\t\"RT USE \"\" Kenny\"\" • Hahahahahahahahaha. #Emmen #Brandstapel\"")

(Note that I had to make sure to use encodeUtf8 on a literal of type Text rather than just using a ByteString literal directly. The IsString instance for ByteString s, which is what's used to convert the literal to a ByteString , truncates each Unicode code point.) (请注意,我必须确保在Text类型的Text上使用encodeUtf8 ,而不是直接使用ByteString文字。用于ByteStringIsString实例(用于将文字转换为ByteString )截断每个Unicode代码点)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM