简体   繁体   English

如何处理Haskell中的CSV记录字段列表

[英]How to handle a list to CSV record field in Haskell

The goal is to read in a text file and convert it to a csv. 目的是读取文本文件并将其转换为csv。 The input text file lines will always have some fields, but there will be other fields that appear zero, 1, or more times. 输入文本文件行将始终具有某些字段,但是其他字段将出现零次,1次或更多次。 The problem is how to handle those fields with varying numbers of items. 问题是如何处理具有不同数量项目的那些字段。

Example of the problem: 问题示例:

I can parse the text file to get a list of "events" that might look like the following where each of the data constructors if for a particular type TrialEvents: 我可以解析文本文件以获取“事件”列表,该列表可能类似于以下内容,其中对于特定类型的TrialEvents,每个数据构造函数:

For instance, 例如,

trialRecord1 = [ Trial {time = 123, trialNum = 1}
               , Efix {eye ='R' ,start = 123, stop = 234, x = 222, y = 123}
               , RewStart {time = 234}, RewEnd {time = 345} ]

trialRecord2 = [ Trial {time = 123, trialNum = 1}
               , RewStart {time = 234}, RewEnd {time = 345} ]

trialRecord3 = [ Trial {time = 123, trialNum = 1}
               , Efix {eye ='R' ,start = 123, stop = 234, x = 222, y = 123}
               , Efix {eye ='R' ,start = 223, stop = 334, x = 100, y = 222}
               , RewStart {time = 234}, RewEnd {time = 345} ]

These lists of events, one for each trial, will always have a trial number and time, but may have 0, 1, or more of other values, such as Efix in this example. 这些事件列表(每个试验一个)将始终具有试验编号和时间,但可能具有0、1或更多其他值,例如本示例中的Efix

My confusion is about how to generate a csv file from such data where I can have column heads such as trialTime, trialNumber, fixationStartTime, ... , rewStartTime, and RewEndTime. 我的困惑是关于如何从此类数据生成csv文件的地方,在这些数据中我可以具有列标题,例如trialTime,trialNumber,fixationStartTime,...,rewStartTime和RewEndTime。 To be able to write the fixationStartTimes I thought I could use a list that could be empty, have one value, or have multiple values. 为了能够编写fixationStartTimes,我认为我可以使用一个可以为空,具有一个值或具有多个值的列表。

But when using the cassava package and writing my own instance of ToField (just testing with a list of integers) with pack $ show I noticed I get quotes and escapes characters when the list is greater than length 1, but not for 1 or zero length lists. 但是,当使用木薯包并用pack $ show编写我自己的ToField实例(仅使用整数列表进行测试)时,我注意到当列表长度大于1时,我得到了引号,并转义了字符,但长度不为1或零列表。

"123,234,\"[1,2]\",345,456\r\n"
"123,234,[1],345,456\r\n"

This will present problems when I try to read the csv file into an analysis program and I will have to handle these varying cases. 当我尝试将csv文件读入分析程序时,这将带来问题,并且我将不得不处理这些不同的情况。

Can anyone advise how I might approach this issue of a variable length list in a conversion to csv and how to make it maximally friendly on reading the CSV into another environment like R? 谁能建议我在转换为csv时如何处理可变长度列表的问题,以及如何在将CSV读取到R等其他环境中时最大程度地使其友好吗?

Thanks. 谢谢。

Can anyone advise how I might approach this issue of a variable length list in a conversion to csv and how to make it maximally friendly on reading the CSV into another environment like R? 谁能建议我在转换为csv时如何处理可变长度列表的问题,以及如何在将CSV读取到R等其他环境中时最大程度地使其友好吗?

Since this seems to be the core question, and the remainder of the post seems to be an XY problem that assumes CSV as the preferred format, here's an isolated answer to that: 由于这似乎是核心问题,而本文的其余部分似乎都是以CSV为首选格式的XY问题,因此这是一个孤立的答案:

How about using JSON instead? 改用JSON怎么样? You can write your data type to a JSON structure that is isomorphic to the one you've already got, and R supports JSON through jsonlite . 您可以将数据类型写入与您已经拥有的同构的JSON结构中,并且R通过jsonlite支持JSON。 Then you can have your variable-length lists in R without needing to encode them into a column layout (and back again?). 然后,您可以在R中使用变长列表,而无需将它们编码为列布局(然后再次返回?)。


If, however, you'd prefer to have that column layout, here's an answer to that: 但是,如果您希望一个列布局,这里是一个问题的答案:

events will always have a trial number and time, but may have 0, 1, or more of other values, such as Efix [, RewStart and RewEnd [?] ] in this example. 事件将始终具有试用编号和时间,但可能具有0、1或更多其他值,例如本示例中的Efix [, RewStartRewEnd [?] ]。

Then having 然后有

data Event = Trial { time :: Int, trialNum :: Int }
           | Efix { eye :: Char, start :: Int, stop :: Int, x :: Int, y :: Int }
           | RewStart { time :: Int }
           | RewEnd { time :: Int }

type Events = [Event]

does not seem to quite model what you're saying. 似乎无法完全模仿您的意思。 How about, instead, 怎么样,

data Event p = Trial { time :: Int, trialNum :: Int, points :: [p] }
data Point = Efix { eye :: Char, start :: Int, stop :: Int, x :: Int, y :: Int }
           | RewStart { time :: Int }
           | RewEnd { time :: Int }

type Events = [Event Point]

Then your records would look like 然后您的记录看起来像

trialRecord1 = Trial { time = 123, trialNum = 1, points =
                 [ Efix { eye = 'R', start = 123, stop = 234, x = 222, y = 123 }
                 , RewStart { time = 234 }
                 , RewEnd { time = 345 } ] }

trialRecord2 = Trial { time = 123, trialNum = 1, points =
                 [ RewStart { time = 234 }
                 , RewEnd { time = 345 } ] }

trialRecord3 = Trial { time = 123, trialNum = 1, points =
                 [ Efix { eye = 'R', start = 123, stop = 234, x = 222, y = 123 }
                 , Efix { eye = 'R', start = 223, stop = 334, x = 100, y = 222 }
                 , RewStart { time = 234 }
                 , RewEnd { time = 345 } ] }

How do I generate a csv file from such data where I can have column heads such as trialTime , trialNum , fixationStartTime , ..., rewStartTime , and rewEndTime . 如何从此类数据生成csv文件,在这些数据中我可以拥有诸如columnTimetrialNumfixationStartTime ,..., rewStartTimerewEndTime的列标题

Since you can only be sure of trialTime and trialNum , those are the only two columns you can hardcode. 由于您只能确定trialTimetrialNum ,因此只有这两个列可以进行硬编码。 The rest of the columns have to be indented according to what points are present in the other events. 其余各列必须根据其他事件中存在的点进行缩进。 For example, rendering trialRecord1 , trialRecord2 and trialRecord3 in a table layout should (probably?) give something like 例如,在表布局中呈现trialRecord1trialRecord2trialRecord3应该(可能?)给出类似的内容

+-----------+-----------+----------+--------------+--------------------+-------------------+------------+------------+--------------+--------------------+-------------------+------------+------------+---------------+-------------+
| recordNum | trialTime | trialNum | fixationEye1 | fixationStartTime1 | fixationStopTime1 | fixationX1 | fixationY1 | fixationEye2 | fixationStartTime2 | fixationStopTime2 | fixationX2 | fixationY2 | rewStartTime1 | rewEndTime1 |
+-----------+-----------+----------+--------------+--------------------+-------------------+------------+------------+--------------+--------------------+-------------------+------------+------------+---------------+-------------+
|         1 |       123 |        1 |            R |                123 |               234 |        222 |        123 |              |                    |                   |            |            |           234 |         345 |
|         2 |       123 |        1 |              |                    |                   |            |            |              |                    |                   |            |            |           234 |         345 |
|         3 |       123 |        1 |            R |                123 |               234 |        222 |        123 |            R |                223 |               334 |        100 |        222 |           234 |         345 |
+-----------+-----------+----------+--------------+--------------------+-------------------+------------+------------+--------------+--------------------+-------------------+------------+------------+---------------+-------------+

You could write a function align :: [Event Point] -> [Event (Maybe Point)] that inserts Nothing s in case of missing data points. 您可以编写一个函数align :: [Event Point] -> [Event (Maybe Point)] ,以在数据点丢失的情况下插入Nothing (A Nothing may correspond to a variable amount of rows depending on what Point is being translated to columns, so you could also consider a function of type [Event Point] -> [Event (Either NumEmptyColumns Point)] where type NumEmptyColumns = Int .) (根据将哪个Point转换为列, Nothing可能不对应于可变数量的行,因此您也可以考虑使用[Event Point] -> [Event (Either NumEmptyColumns Point)] type NumEmptyColumns = Int ,其中type NumEmptyColumns = Int 。 )

Running align [ trialRecord1, trialRecord2, trialRecord3 ] could then give the value 然后运行align [ trialRecord1, trialRecord2, trialRecord3 ]可以给出值

[ Trial { time = 123, trialNum = 1, points =
    [ Just $ Efix { eye ='R', start = 123, stop = 234, x = 222, y = 123 }
    , Nothing
    , Just $ RewStart { time = 234 }
    , Just $ RewEnd { time = 345 } ] }

, Trial { time = 123, trialNum = 1, points =
    [ Nothing
    , Nothing
    , Just $ RewStart { time = 234 }
    , Just $ RewEnd { time = 345 } ] }

, Trial { time = 123, trialNum = 1, points =
    [ Just $ Efix { eye ='R', start = 123, stop = 234, x = 222, y = 123 }
    , Just $ Efix { eye ='R', start = 223, stop = 334, x = 100, y = 222 }
    , Just $ RewStart { time = 234 }
    , Just $ RewEnd { time = 345 } ] }
]

Turning this regular (non-jagged) list of lists into a regular csv should be more straight-forward. 将列表的常规(非锯齿状)列表转换为常规的csv应该更为简单。

with pack $ show I noticed I get quotes and escapes characters when the list is greater than length 1 使用pack $ show我注意到当列表大于长度1时,我得到了引号并转义了字符

 "123,234,\\"[1,2]\\",345,456\\r\\n" 

As @DarthFennec says, this is because the value [1,2] contains a comma which is a special character in your csv. 正如@DarthFennec所说,这是因为值[1,2]包含一个逗号,这是csv中的特殊字符。 The only kind of escape character you have here is the " s -- the \\ s are Haskell escape codes to show a string that contains a quote: 您在这里拥有的唯一一种转义字符是" - \\ s是Haskell转义代码,用于显示包含引号的字符串:

GHCi> putStrLn "123,234,\"[1,2]\",345,456\r\n"
123,234,"[1,2]",345,456

This is how the string actually looks like. 这就是字符串的实际外观。

But having multiple values in a Haskell-syntax list literal in a csv file is probably not "maximally friendly". 但是,在csv文件的Haskell语法列表文字中具有多个值可能不是“最大程度的友好”。 If you're doing that, then perhaps JSON is a better alternative. 如果您正在这样做,那么JSON也许是更好的选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM