简体   繁体   English

Haskell:扫描列表并为每个元素应用不同的函数

[英]Haskell: Scan Through a List and Apply A Different Function for Each Element

I need to scan through a document and accumulate the output of different functions for each string in the file. 我需要扫描文档并为文件中的每个字符串累积不同函数的输出。 The function run on any given line of the file depends on what is in that line. 在文件的任何给定行上运行的函数取决于该行中的内容。

I could do this very inefficiently by making a complete pass through the file for every list I wanted to collect. 我可以通过为我想要收集的每个列表完整传递文件来非常低效地执行此操作。 Example pseudo-code: 示例伪代码:

at :: B.ByteString -> Maybe Atom
at line
    | line == ATOM record = do stuff to return Just Atom
    | otherwise = Nothing

ot :: B.ByteString -> Maybe Sheet
ot line
    | line == SHEET record = do other stuff to return Just Sheet
    | otherwise = Nothing

Then, I would map each of these functions over the entire list of lines in the file to get a complete list of Atoms and Sheets: 然后,我会将这些函数映射到文件中的整个行列表中,以获得Atoms和Sheets的完整列表:

mapper :: [B.ByteString] -> IO ()
mapper lines = do
    let atoms = mapMaybe at lines
    let sheets = mapMaybe to lines
    -- Do stuff with my atoms and sheets

However, this is inefficient because I am maping through the entire list of strings for every list I am trying to create. 但是,这是低效的,因为我正在编写我想要创建的每个列表的整个字符串列表。 Instead, I want to map through the list of line strings only once, identify each line as I am moving through it, and then apply the appropriate function and store these values in different lists. 相反,我想只在线字符串列表中映射一次,在我移动它时识别每一行,然后应用适当的函数并将这些值存储在不同的列表中。

My C mentality wants to do this (pseudo code): 我的C心态想要这样做(伪代码):

mapper' :: [B.ByteString] -> IO ()
mapper' lines = do
    let atoms = []
    let sheets = []
    for line in lines:
        | line == ATOM record = (atoms = atoms ++ at line)
        | line == SHEET record = (sheets = sheets ++ ot line)
    -- Now 'atoms' is a complete list of all the ATOM records
    --  and 'sheets' is a complete list of all the SHEET records

What is the Haskell way of doing this? Haskell的做法是什么? I simply can't get my functional-programming mindset to come up with a solution. 我根本无法得到我的功能编程思维方式来提出解决方案。

First of all, I think that the answers others have supplied will work at least 95% of the time. 首先,我认为其他人提供的答案将至少在95%的时间内起作用。 It's always good practice to code for the problem at hand by using appropriate data types (or tuples in some cases). 通过使用适当的数据类型(或某些情况下的元组)来编码手头的问题总是好的做法。 However, sometimes you really don't know in advance what you're looking for in the list, and in these cases trying to enumerate all possibilities is difficult/time-consuming/error-prone. 但是,有时候你真的不知道你在列表中找到了什么,在这些情况下,试图列举所有可能性是困难/耗时/容易出错的。 Or, you're writing multiple variants of the same sort of thing (manually inlining multiple folds into one) and you'd like to capture the abstraction. 或者,您正在编写同一类型的多个变体(手动将多个折叠内联到一个中),并且您希望捕获抽象。

Fortunately, there are a few techniques that can help. 幸运的是,有一些技术可以提供帮助。

The framework solution 框架解决方案

(somewhat self-evangelizing) (有点自我宣传)

First, the various "iteratee/enumerator" packages often provide functions to deal with this sort of problem. 首先,各种“iteratee / enumerator”包通常提供处理这类问题的功能。 I'm most familiar with iteratee , which would let you do the following: 我最熟悉iteratee ,它可以让你做到以下几点:

import Data.Iteratee as I
import Data.Iteratee.Char
import Data.Maybe

-- first, you'll need some way to process the Atoms/Sheets/etc. you're getting
-- if you want to just return them as a list, you can use the built-in
-- stream2list function

-- next, create stream transformers
-- given at :: B.ByteString -> Maybe Atom
-- create a stream transformer from ByteString lines to Atoms
atIter :: Enumeratee [B.ByteString] [Atom] m a
atIter = I.mapChunks (catMaybes . map at)

otIter :: Enumeratee [B.ByteString] [Sheet] m a
otIter = I.mapChunks (catMaybes . map ot)

-- finally, combine multiple processors into one
-- if you have more than one processor, you can use zip3, zip4, etc.
procFile :: Iteratee [B.ByteString] m ([Atom],[Sheet])
procFile = I.zip (atIter =$ stream2list) (otIter =$ stream2list)

-- and run it on some data
runner :: FilePath -> IO ([Atom],[Sheet])
runner filename = do
  resultIter <- enumFile defaultBufSize filename $= enumLinesBS $ procFile
  run resultIter

One benefit this gives you is extra composability. 这给你带来的好处是额外的可组合性。 You can create transformers as you like, and just combine them with zip. 您可以根据需要创建变换器,并将它们与zip组合。 You can even run the consumers in parallel if you like (although only if you're working in the IO monad, and probably not worth it unless the consumers do a lot of work) by changing to this: 如果你愿意的话,你甚至可以并行运行消费者(虽然只有你在IO monad中工作,而且除非消费者做了很多工作,否则可能不值得),改为:

import Data.Iteratee.Parallel

parProcFile = I.zip (parI $ atIter =$ stream2list) (parI $ otIter =$ stream2list)

The result of doing so isn't the same as a single for-loop - this will still perform multiple traversals of the data. 这样做的结果与单个for循环不同 - 这仍然会执行多次遍历数据。 However, the traversal pattern has changed. 但是,遍历模式已经改变。 This will load a certain amount of data at once ( defaultBufSize bytes) and traverse that chunk multiple times, storing partial results as necessary. 这将一次加载一定数量的数据( defaultBufSize字节)并多次遍历该块,并根据需要存储部分结果。 After a chunk has been entirely consumed, the next chunk is loaded and the old one can be garbage collected. 在完全消耗了一个块之后,加载下一个块并且可以对旧的块进行垃圾收集。

Hopefully this will demonstrate the difference: 希望这将证明不同之处:

Data.List.zip:
  x1 x2 x3 .. x_n
                   x1 x2 x3 .. x_n

Data.Iteratee.zip:
  x1 x2      x3 x4      x_n-1 x_n
       x1 x2      x3 x4           x_n-1 x_n

If you're doing enough work that parallelism makes sense this isn't a problem at all. 如果你做的工作足够平行,那么这根本不是问题。 Due to memory locality, the performance is much better than multiple traversals over the entire input as Data.List.zip would make. 由于内存局部性,性能比整个输入上的多次遍历要好得多,就像Data.List.zip那样。

The beautiful solution 美丽的解决方案

If a single-traversal solution really does make the most sense, you might be interested in Max Rabkin's Beautiful Folding post, and Conal Elliott's followup work ( this too ). 如果一个单遍历解决方案确实最有意义,你可能会对Max Rabkin的Beautiful Folding帖子和Conal Elliott的后续 工作感兴趣( 这也是如此 )。 The essential idea is that you can create data structures to represent folds and zips, and combining these lets you create a new, combined fold/zip function that only needs one traversal. 基本的想法是,您可以创建数据结构来表示折叠和拉链,并且组合这些可以创建一个新的组合折叠/拉链功能,只需要一次遍历。 It's maybe a little advanced for a Haskell beginner, but since you're thinking about the problem you may find it interesting or useful. 对于Haskell初学者来说,这可能有点先进,但既然你正在考虑这个问题,你可能会觉得它很有趣或有用。 Max's post is probably the best starting point. 马克斯的帖子可能是最好的起点。

I show a solution for two types of line, but it is easily extended to five types of line by using a five-tuple instead of a two-tuple. 我展示了两种类型的线的解决方案,但是通过使用五元组而不是两元组,它很容易扩展到五种类型的线。

import Data.Monoid

eachLine :: B.ByteString -> ([Atom], [Sheet])
eachLine bs | isAnAtom bs = ([ {- calculate an Atom -} ], [])
            | isASheet bs = ([], [ {- calculate a Sheet -} ])
            | otherwise = error "eachLine"

allLines :: [B.ByteString] -> ([Atom], [Sheet])
allLines bss = mconcat (map eachLine bss)

The magic is done by mconcat from Data.Monoid (included with GHC). 魔术是由mconcat (包含在GHC中)的mconcat完成的。

(On a point of style: personally I would define a Line type, a parseLine :: B.ByteString -> Line function and write eachLine bs = case parseLine bs of ... . But this is peripheral to your question.) (在一个风格点上:我个人会定义一个Line类型,一个parseLine :: B.ByteString -> Line函数并编写eachLine bs = case parseLine bs of ...但这是你问题的外围。)

It is a good idea to introduce a new ADT, eg "Summary" instead of tuples. 引入新的ADT是个好主意,例如“摘要”而不是元组。 Then, since you want to accumulate the values of Summary you came make it an istance of Data.Monoid. 然后,既然你想积累Summary的值,你就会把它作为Data.Monoid的一个等值。 Then you classify each of your lines with the help of classifier functions (eg isAtom, isSheet, etc.) and concatenate them together using Monoid's mconcat function (as suggested by @dave4420). 然后使用分类器函数(例如isAtom,isSheet等)对每个行进行分类,并使用Monoid的mconcat函数将它们连接在一起(如@ dave4420所示)。

Here is the code (it uses String instead of ByteString, but it is quite easy to change): 这是代码(它使用String而不是ByteString,但它很容易改变):

module Classifier where

import Data.List
import Data.Monoid

data Summary = Summary
  { atoms :: [String]
  , sheets :: [String]
  , digits :: [String]
  } deriving (Show)

instance Monoid Summary where
  mempty = Summary [] [] []
  Summary as1 ss1 ds1 `mappend` Summary as2 ss2 ds2 =
    Summary (as1 `mappend` as2)
            (ss1 `mappend` ss2)
            (ds1 `mappend` ds2)

classify :: [String] -> Summary
classify = mconcat  . map classifyLine

classifyLine :: String -> Summary
classifyLine line
  | isAtom line  = Summary [line] [] [] -- or "mempty { atoms = [line] }"
  | isSheet line = Summary [] [line] []
  | isDigit line = Summary [] [] [line]
  | otherwise    = mempty -- or "error" if you need this  

isAtom, isSheet, isDigit :: String -> Bool
isAtom = isPrefixOf "atom"
isSheet = isPrefixOf "sheet"
isDigit = isPrefixOf "digits"

input :: [String]
input = ["atom1", "sheet1", "sheet2", "digits1"]

test :: Summary
test = classify input

If you have only 2 alternatives, using Either might be a good idea. 如果你只有2个选择,使用Either可能是个好主意。 In that case combine your functions, map the list, and use lefts and rights to get the results: 在这种情况下,组合您的函数,映射列表,并使用左侧和权限来获得结果:

import Data.Either

-- first sample function, returning String
f1 x = show $ x `div` 2

-- second sample function, returning Int
f2 x = 3*x+1

-- combined function returning Either String Int
hotpo x = if even x then Left (f1 x) else Right (f2 x)

xs = map hotpo [1..10] 
-- [Right 4,Left "1",Right 10,Left "2",Right 16,Left "3",Right 22,Left "4",Right 28,Left "5"]

lefts xs 
-- ["1","2","3","4","5"]

rights xs
-- [4,10,16,22,28]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM