简体   繁体   English

Haskell的哪个XML解析器?

[英]Which XML parser for Haskell?

I'm trying to write some application, that performs analysis of data, stored in pretty big XML files (from 10 to 800MB). 我正在尝试编写一些应用程序,该应用程序执行数据分析,并存储在很大的XML文件中(从10到800MB)。 Each set of data is stored as single tag, with concrete data specified as attrobutes. 每组数据都存储为单个标签,具体数据指定为Attrobuts。 I'm currently saxParse from HaXml, and I'm not satisfied with memory usage during work with it. 我目前是HaXml的saxParse,使用它时对内存使用情况不满意。 On parsing of 15Mb XML file it consumes more than 1Gb of memory, although I tried to not to store data in the lists, and process it immediately. 在解析15Mb XML文件时,它消耗了超过1Gb的内存,尽管我试图不将数据存储在列表中,并立即对其进行处理。 I use following code: 我使用以下代码:

importOneFile file proc ioproc = do
  xml <- readFile file
  let (sxs, res) = saxParse file $ stripUnicodeBOM xml
  case res of
      Just str -> putStrLn $ "Error: " ++ str;
      Nothing -> forM_ sxs (ioproc . proc . (extractAttrs "row"))

where 'proc' - procedure, that performs conversion of data from attributes into record, and 'ioproc' - procedure, that performs some IO action - output to screen, storing in database, etc. 其中,“ proc”-执行数据从属性到记录的转换过程,“ ioproc”-执行某些IO动作的过程-输出到屏幕,存储在数据库中,等等。

How i can decrease memory consumption during XML parsing? 如何在XML解析期间减少内存消耗? Should switching to another XML parser help? 切换到另一个XML解析器应该有用吗?

Update: and which parser supports for different input encodings - utf-8, utf-16, utf-32, etc.? 更新:哪个解析器支持不同的输入编码-utf-8,utf-16,utf-32等?

If you're willing to assume that your inputs are valid, consider looking at TagSoup or Text.XML.Light from the Galois folks. 如果您愿意假设您的输入有效,请考虑查看Galois人员的TagSoupText.XML.Light

These take strings as input, so you can (indirectly) feed them anything Data.Encoding understands, namely 这些将字符串作为输入,因此您可以(间接)向其提供任何Data.Encoding可以理解的内容,即

  • ASCII ASCII码
  • UTF8 UTF8
  • UTF16 UTF16
  • UTF32 UTF32
  • KOI8R KOI8R
  • KOI8U KOI8U
  • ISO88591 ISO88591
  • GB18030 GB18030
  • BootString 引导字符串
  • ISO88592 ISO88592
  • ISO88593 ISO88593
  • ISO88594 ISO88594
  • ISO88595 ISO88595
  • ISO88596 ISO88596
  • ISO88597 ISO88597
  • ISO88598 ISO88598
  • ISO88599 ISO88599
  • ISO885910 ISO885910
  • ISO885911 ISO885911
  • ISO885913 ISO885913
  • ISO885914 ISO885914
  • ISO885915 ISO885915
  • ISO885916 ISO885916
  • CP1250 CP1250
  • CP1251 CP1251
  • CP1252 CP1252
  • CP1253 CP1253
  • CP1254 CP1254
  • CP1255 CP1255
  • CP1256 CP1256
  • CP1257 CP1257
  • CP1258 CP1258
  • MacOSRoman MacOS罗马
  • JISX0201 JISX0201
  • JISX0208 JISX0208
  • ISO2022JP ISO2022JP
  • JISX0212 JISX0212

I'm no Haskell expert, but what you're running into sounds like a classic space-leak (ie, a situation in which Haskell's lazy evaluation is causing it to reserve more memory than necessary). 我不是Haskell专家,但您遇到的声音听起来像是经典的空间泄漏(即,Haskell的惰性评估导致它保留了不必要的内存)。 You may be able to solve it by forcing strictness on your saxParse output. 您可能可以通过对saxParse输出强制严格来解决此问题。

There's also a good chapter on profiling and optimization in Real World Haskell. 关于真实世界的Haskell中的分析和优化,也有一章很好

EDIT: Found another good resource on profiling/finding bottlenecks here . 编辑:这里找到了另一个关于概要分析/查找瓶颈的好资源。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM