简体   繁体   English

函数式语言中的 Kernighan & Ritchie 字数统计示例程序

[英]Kernighan & Ritchie word count example program in a functional language

I have been reading a little bit about functional programming on the web lately and I think I got a basic idea about the concepts behind it.我最近在 web 上阅读了一些关于函数式编程的内容,我想我对它背后的概念有了一个基本的了解。

I'm curious how everyday programming problems which involve some kind of state are solved in a pure functional programing language.我很好奇涉及某种 state 的日常编程问题是如何用纯函数式编程语言解决的。

For example: how would the word count program from the book 'The C programming Language' be implemented in a pure functional language?例如:“C 编程语言”一书中的字数统计程序如何用纯函数式语言实现?

Any contributions are welcome as long as the solution is in a pure functional style.只要解决方案是纯函数式的,就欢迎任何贡献。

Here's the word count C code from the book:这是书中的字数 C 代码:

#include <stdio.h>

#define IN  1 /* inside a word */
#define OUT 0 /* outside a word */

/* count lines, words, and characters in input */
main()
{
  int c, nl, nw, nc, state;

  state = OUT;
  nl = nw = nc = 0;
  while ((c = getchar()) != EOF) {
    ++nc;
    if (c == '\n')
      ++nl;
    if (c == ' ' || c == '\n' || c = '\t')
      state = OUT;
    else if (state == OUT) {
      state = IN;
      ++nw;
    }
  }

  printf("%d %d %d\n", nl, nw, nc);
}

Basically, in a functional styly you'll want to divide the IO operation of getting your stream of data from the pure operation of some stateful transistion based on the current character and the current state.基本上,在功能风格中,您需要根据当前字符和当前 state 将获取 stream 数据的 IO 操作与某些有状态转换的纯操作分开。

The Haskell solution from Tikhon is very clean but performs three passes on your input data and will result in the entire input being contained in memory until the result is computed.来自 Tikhon 的 Haskell 解决方案非常干净,但对您的输入数据执行三遍,并且将导致整个输入包含在 memory 中,直到计算出结果。 You can process data incrementally, which I do below using the Text package but no other advanced Haskell tools (which could clean this up at the expense of understandability by non-Haskellers).您可以增量处理数据,我在下面使用 Text package 而不是其他高级 Haskell 工具(这可能会以非 Haskellers 的可理解性为代价清理它)。

First we have our preamble:首先我们有序言:

{-# LANGUAGE BangPatterns #-}

import Data.Text.Lazy as T
import Data.Text.Lazy.IO as TIO

Then we define our data structure to hold the state of the process (number of characters, words, and lines along with the State IN/OUT):然后我们定义我们的数据结构来保存进程的 state(字符数、单词数和行数以及 State IN/OUT):

data Counts = Cnt { nc, nl, nw :: !Int
                  , state :: State  }
        deriving (Eq, Ord, Show)

data State = IN | OUT
        deriving (Eq, Ord, Show)

Now I define a "zero" state just for easy use.现在我定义一个“零” state 只是为了方便使用。 I'd normally make a number of helper functions or use a package like lense to make incrementing each field in the Counts structure simple, but will go without for this answer:我通常会创建一些辅助函数或使用 package 之类的 lense 来使Counts结构中每个字段的递增变得简单,但如果没有这个答案, go 会:

zeros :: Counts
zeros = Cnt 0 0 0 OUT

And now I translate your series of if/else statements into a pure state machine:现在我将你的一系列 if/else 语句翻译成一台纯 state 机器:

op :: Counts -> Char -> Counts
op c '\n' = c { nc = nc c + 1, nw = nw c + 1, nl = nl c + 1, state=OUT }
op c ch | ch == ' ' || ch == '\t' = c { nc = nc c + 1, state=OUT }
        | state c == OUT = c { nc = nc c + 1, nw = nw c + 1, state = IN }
        | otherwise  = c { nc = nc c + 1 }

Finally the main function just gets the input stream and folds our operation over the characters:最后main function 只得到输入 stream 并将我们的操作折叠到字符上:

main = do
        contents <- TIO.getContents
        print $ T.foldl' op zeros contents

EDIT: You mentioned not understanding the syntax.编辑:你提到不理解语法。 Here is an even simpler version that I will explain:这是我将解释的更简单的版本:

import Data.Text.Lazy as T
import Data.Text.Lazy.IO as TIO

op (nc, nw, nl, st) ch
  | ch == '\n'              = (nc + 1, nw + 1 , nl + 1, True)
  | ch == ' ' || ch == '\t' = (nc + 1, nw     , nl    , True)
  | st                      = (nc + 1, nw + 1 , nl    , False)
  | otherwise               = (nc + 1, nw     , nl    , st)

main = do
        contents <- TIO.getContents
        print $ T.foldl' op (0,0,0,True) contents
  • The import statements give us access to the getContents and foldl' functions we use. import语句使我们能够访问我们使用的getContentsfoldl'函数。

  • The op function uses a bunch of guards - parts like | ch = '\n' op function 使用了一堆守卫 - 部分像| ch = '\n' | ch = '\n' - which is basically like a C if/elseif/else series. | ch = '\n' - 基本上类似于 C if/elseif/else 系列。

  • The tuples (... , ... , ... , ... ) contain all our state. Haskell variables are immutable, so we make new tuples by adding one (or not) to the values of the previous variables.元组(... , ... , ... , ... )包含我们所有的 state。Haskell 变量是不可变的,因此我们通过将一个(或不)添加到先前变量的值来创建新的元组。

A simple way to do it would be to read in the input and then use some simple functions to get the line/word/character count.一种简单的方法是读取输入,然后使用一些简单的函数来获取行/字/字符数。 Something like this would work:这样的事情会起作用:

count :: String -> (Int, Int, Int)
count str = (length $ lines str, length $ words str, length str)

main :: IO ()
main = fmap count getContents >>= print

This isn't exactly the same, but it's close.这不完全相同,但很接近。

This works really simply.这真的很简单。 Given a string, we can turn it into a list of lines with the standard lines function and a list of words with the standard words function. Since String is just [Char] , length returns the number of characters.给定一个字符串,我们可以将其转换为包含标准lines function 的行列表和包含标准words function 的单词列表。由于String只是[Char]length返回字符数。 This is how we get the three counts.这就是我们如何获得三个计数。 (For reference, length $ lines str is the same as length (lines str) .) (作为参考, length $ lines strlength (lines str)相同。)

The important idea is how the IO --reading the input and printing it out--is separated from the actual logic.重要的想法是IO读取输入并将其打印出来——如何与实际逻辑分离。

Also, instead of going through the input character by character keeping track of some state, we get the actual numbers by applying simple functions to the input.此外,我们不是逐个字符地跟踪输入 state,而是通过对输入应用简单函数来获取实际数字。 These functions are all just compositions of standard library functions.这些函数都是标准库函数的组合。

In your loop there are four state variables, nc, nw, nl and state, plus the next character c. The loop remembers nc, nw, nl and state from the last time through the loop, and c changes each iteration through the loop.在您的循环中有四个 state 变量,nc、nw、nl 和 state,加上下一个字符 c。循环会记住上次循环的 nc、nw、nl 和 state,并且 c 通过循环更改每次迭代。 Imagine instead that you take those variables out of the loop and put them in a vector: [state, nc, nw, nl].想象一下,您将这些变量从循环中取出并将它们放入一个向量中:[state, nc, nw, nl]。 Then you change your loop construct into a function that takes two arguments, the first being a vector [state, nc, nw, nl], and the second being c, and returns a new vector with the updated values of nc, nw, nl and state. In C-ish pseudocode:然后,将循环构造更改为 function,它采用两个 arguments,第一个是向量 [state, nc, nw, nl],第二个是 c,并返回一个新向量,其更新值为 nc, nw, nl和 state。在 C-ish 伪代码中:

f([state, nc, nw, nl], c) {
    ++nc;
    if (c == '\n')
      ++nl;
    if (c == ' ' || c == '\n' || c = '\t')
      state = OUT;
    else if (state == OUT) {
      state = IN;
      ++nw;
    }
    return [state, nc, nw, nl];
}

Now you can call that function with the vector [OUT, 0, 0, 0] and the first character in the string ("hello, world", say), and it will return a new vector [IN, 1, 0, 0].现在您可以使用向量 [OUT, 0, 0, 0] 和字符串中的第一个字符(例如“hello, world”)调用 function,它将返回一个新向量 [IN, 1, 0, 0 ]. Call f again with this new vector and the second character 'e', and it returns [IN, 2, 0, 0].使用这个新向量和第二个字符“e”再次调用 f,它返回 [IN, 2, 0, 0]。 Repeat for the rest of the characters in the string, and the last call will return [IN, 12, 2, 0], identical to the values printed by the C code.对字符串中的字符 rest 重复,最后一次调用将返回 [IN, 12, 2, 0],与 C 代码打印的值相同。 The basic idea is that you take the state variables out of the loop, turn the guts of the loop into a function, and pass the vector of state variables and the next input in as arguments to that function, and return a new state vector as a result.基本思想是,将 state 个变量从循环中取出,将循环的核心变成 function,然后将 state 个变量的向量和下一个输入作为 arguments 传递给该 function 向量,并返回一个新的 882071483874 作为结果。 There is a function called reduce that does this.有一个名为 reduce 的 function 可以执行此操作。

Here's how you would do it in Clojure (formatted to emphasize the vectors returned):以下是您在 Clojure 中的操作方式(格式化以强调返回的向量):

(defn f [[state nc nw nl] c]
  (let [nl (if (= c \n)(inc nl) nl)]
    (cond
     (or (= c \space)(= c \n)(= c \t)) [:out  (inc nc) nw       nl]
     (= state :out)                    [:in   (inc nc) (inc nw) nl]
     true                              [state (inc nc) nw       nl]
)))

(defn wc [s] (reduce f [:out 0 0 0] s))

(wc "hello, world")

which returns (and prints in the repl) [:in 12 2 0]它返回(并在 repl 中打印)[:in 12 2 0]

Here's my shot at a purely functional, strict, single-pass, tail-recursive solution in Scheme:这是我在 Scheme 中的纯函数式、严格的、单次通过、尾递归解决方案的镜头:

(define (word-count input-port)
  (let loop ((c (read-char input-port))
             (nl 0)
             (nw 0)
             (nc 0)
             (state 'out))
    (cond ((eof-object? c)
           (printf "nl: ~s, nw: ~s, nc: ~s\n" nl nw nc))
          ((char=? c #\newline)
           (loop (read-char input-port) (add1 nl) nw (add1 nc) 'out))
          ((char-whitespace? c)
           (loop (read-char input-port) nl nw (add1 nc) 'out))
          ((eq? state 'out)
           (loop (read-char input-port) nl (add1 nw) (add1 nc) 'in))
          (else
           (loop (read-char input-port) nl nw (add1 nc) state)))))

word-count receives an input port as a parameter; word-count接收一个input port作为参数; notice that no additional data structures are created (structs, tuples, vectors, etc.) instead, all state is kept in parameters.请注意,没有创建额外的数据结构(结构、元组、向量等),所有 state 都保存在参数中。 As an example, for counting the words in a file containing this:例如,为了计算包含以下内容的文件中的单词:

hello, world

Call the procedure like this:像这样调用过程:

(call-with-input-file "/path/to/file" word-count)
> nl: 0, nw: 2, nc: 12

Common Lisp is mentioned, but it is not a pure functional programming language and it does not support TCO in its standard.提到了 Common Lisp,但它不是一种纯函数式编程语言,它的标准也不支持 TCO。 Individual implementations do.个别实施。

Tail recursive version, if the compiler supports it:尾递归版本,如果编译器支持的话:

(defun word-count (&optional (stream *standard-input*))
  (labels ((word-count-aux (in-p chars words lines)
             (case (read-char stream nil :eof)
               (:eof (values chars words lines))
               (#\newline (word-count-aux nil (1+ chars) words (1+ lines)))
               ((#\space #\tab)   (word-count-aux nil (1+ chars) words lines))
               (otherwise (word-count-aux t
                                          (1+ chars)
                                          (if in-p words (1+ words))
                                          lines)))))
    (word-count-aux nil 0 0 0)))

But since TCO is not in the standard, a portable version would look more like this:但由于 TCO 不在标准中,便携版本看起来更像这样:

(defun word-count (&optional (stream *standard-input*)
                   &aux (in-p nil) (chars 0) (words 0) (lines 0) char)
  (loop while (setf char (read-char stream nil nil)) do
        (case char
          (#\newline         (setf in-p nil) (incf lines))
          ((#\space #\tab)   (setf in-p nil))
          (otherwise (unless in-p (incf words)) (setf in-p t)))
        (incf chars))
  (values chars words lines))

Above is no longer Functional .以上不再是Functional

We can replace the loop with a higher-order stream-map :我们可以用高阶stream-map替换循环:

(defun stream-map (function stream)
  (loop for char = (read-char stream nil nil)
        while char do (funcall function char)))

(defun word-count (&optional (stream *standard-input*)
                   &aux (in-p nil) (chars 0) (words 0) (lines 0) char)
  (stream-map (lambda (char)
                (incf chars)
                (when (eql char #\newline)
                  (incf lines))
                (if (member char '(#\space #\newline #\tab))
                    (setf in-p nil)
                  (unless in-p
                    (incf words)
                    (setf in-p t))))
              stream)
  (values chars words lines))

The state is modified by the closure. state 由闭包修改。 To get rid of that we can implement a stream-reduce .为了摆脱它,我们可以实施stream-reduce

(defun stream-reduce (function stream &key initial-value)
  (let ((value initial-value))
    (loop for char = (read-char stream nil nil)
          while char
          do (setf value (funcall function value char)))
  value))

(defun word-count (&optional (stream *standard-input*))
  (rest (stream-reduce
          (lambda (state char)
            (destructuring-bind (in-p chars words lines) state
               (case char
                  (#\newline         (list nil (1+ chars) words (1+ lines)))
                  ((#\space #\tab)   (list nil (1+ chars) words lines))
                  (otherwise         (list t
                                           (1+ chars)
                                           (if in-p words (1+ words))
                                           lines)))))
          stream
          :initial-value (list nil 0 0 0))))

Here is a Scheme version of the program, from my blog , which implements the entire Unix word count program, including argument- and file-handling.这是程序的 Scheme 版本,来自我的博客,它实现了整个 Unix 字数统计程序,包括参数和文件处理。 The key function is wc, which is purely functional. key function 是wc,纯函数。 It moves all local variables into the arguments of a local function (defined via named-let), which is the standard idiom for converting an imperative loop to functional style.它将所有局部变量移动到局部 function 的 arguments(通过 named-let 定义),这是将命令式循环转换为函数式风格的标准习惯用法。 The man page and code appear below:手册页和代码如下所示:

NAME

    wc -- word count

SYNOPSIS

    wc [ -lwc ] [ name ... ]

DESCRIPTION

    Wc counts lines, words and characters in the named files,
    or in the standard input if no name appears. A word is a
    maximal string of characters delimited by spaces, tabs or
    newlines.

    If the optional argument is present, just the specified
    counts (lines, words, or characters) are selected by the
    letters l, w or c.

#! /usr/bin/scheme --script

(define l-flag #t)
(define w-flag #t)
(define c-flag #t)

(define (update-flags fs)
  (if (not (member #\l fs)) (set! l-flag #f))
  (if (not (member #\w fs)) (set! w-flag #f))
  (if (not (member #\c fs)) (set! c-flag #f)))

(define (put-dec n width)
  (let* ((n-str (number->string n)))
    (display (make-string (- width (string-length n-str)) #\space))
    (display n-str)))

(define (wc)
  (let loop ((inword #f) (c (read-char)) (ls 0) (ws 0) (cs 0))
    (cond ((eof-object? c) (values ls ws cs))
          ((char=? c #\newline)
            (loop #f (read-char) (add1 ls) ws (add1 cs)))
          ((not (member c '(#\space #\newline #\tab)))
            (if inword
                (loop #t (read-char) ls ws (add1 cs))
                (loop #t (read-char) ls (add1 ws) (add1 cs))))
          (else (loop #f (read-char) ls ws (add1 cs))))))

(define (main args)
  (when (and (pair? args) (char=? (string-ref (car args) 0) #\-))
        (update-flags (cdr (string->list (car args))))
        (set! args (cdr args)))
  (if (null? args)
      (let-values (((ls ws cs) (wc)))
        (when l-flag (display ls) (display " "))
        (when w-flag (display ws) (display " "))
        (when c-flag (display cs) (display " "))
        (newline))
      (let loop ((args args) (l-tot 0) (w-tot 0) (c-tot 0))
        (if (null? args)
            (begin (when l-flag (put-dec l-tot 12))
                   (when w-flag (put-dec w-tot 12))
                   (when c-flag (put-dec c-tot 12)))
            (with-input-from-file (car args)
              (lambda ()
                (let-values (((ls ws cs) (wc)))
                  (when l-flag (put-dec ls 12))
                  (when w-flag (put-dec ws 12))
                  (when c-flag (put-dec cs 12))
                  (display " ") (display (car args)) (newline)
                  (loop (cdr args) (+ l-tot ls) (+ w-tot ws) (+ c-tot cs)))))))))     

(main (cdr (command-line)))

Here's a solution based on the Clojure example posted here but in CL using recursion.这是一个基于此处发布的 Clojure 示例的解决方案,但在 CL 中使用递归。

(defstruct (state (:constructor make-state (state chars words lines)))
  state chars words lines)


(defun wc (state stream)
  (symbol-macrolet ((s (state-state state))
                    (c (state-chars state))
                    (w (state-words state))
                    (l (state-lines state)))

    (case (read-char stream nil :eof)
      (:eof state)

      (#\Newline (wc (make-state :out (1+ c) w (1+ l)) stream))
      (#\Space   (wc (make-state :out (1+ c) w     l)  stream))

      (t (if (eq s :out)
             (wc (make-state :in (1+ c) (1+ w) l) stream)
             (wc (make-state :in (1+ c)     w  l) stream))))))


(with-input-from-string (stream "Hello Functional Programming World")
  (wc (make-state :out 0 0 0) stream))

;;; ;;; => #S(STATE:STATE:IN:CHARS 34:WORDS 4:LINES 0) => #S(STATE:STATE:IN:CHARS 34:WORDS 4:LINES 0)

I believe you could write this somewhat more elegantly while still only iterating over the input once, but you'll need to make GHC do more work, certainly use -O2 .我相信你可以写得更优雅一些,同时仍然只迭代输入一次,但你需要让 GHC 做更多的工作,当然使用-O2

I have not yet compiled this code, much less compared it's speed vs. Thomas DuBuisson's answer, but this should indicate the basic direction.我还没有编译这段代码,更不用说它的速度与 Thomas DuBuisson 的答案相比了,但这应该表明了基本方向。

{-# LANGUAGE BangPatterns #-}
import Data.List

wordcount = snd . foldl' go (False,0) 
  where  go (!b,!n) !c =  if  elem c [' ','\t','\n']  then  (False,n)
              else  (True, n + if b then 0 else 1)

linecount = foldl' go 0
  where  go !n !c = n + if c == '\n' then 1 else 0

main = interact $ show . go
  where  go x = (linecount x, wordcount x, foldl' (\!n _ ->n+1) 0 x)

If I understand stream fusion correctly, then GHC should inline wordcount and linecount into main , merge the three foldl' commands into one, well hopefully, and start rearranging the if checks.如果我正确理解 stream 融合,那么 GHC 应该将wordcountlinecount内联到main中,将三个foldl'命令合并为一个,希望如此,然后开始重新安排 if 检查。 I'd hope it'd inlined elem and foldl' too of course.当然,我希望它也内联elemfoldl'

If not, you could certainly force inlining and probably create a simple fusion rule, but maybe the defaults suffice.如果没有,您当然可以强制内联并可能创建一个简单的融合规则,但也许默认值就足够了。 Or maybe some simple alterations produce the desired effect.或者,也许一些简单的改动会产生预期的效果。

Btw, I have written foldl' (\n _ ->n+1) 0 x only because I've heard bad storied about length , but maybe length works fine, another change worth profiling.顺便说一句,我写foldl' (\n _ ->n+1) 0 x只是因为我听说过关于length的坏故事,但也许length工作正常,另一个值得分析的变化。

In Haskell using strict IO rather than lazy.在 Haskell 中使用严格的 IO 而不是懒惰的。 Does words only but you can easily implement characters and lines on top of this.只做文字,但您可以在此基础上轻松实现字符和线条。 Requires the text and conduit packages:需要textconduit包:

module Main
where

import Control.Applicative
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import qualified Data.Conduit.Text as CT
import qualified Data.Text as T
import System.Environment

main :: IO ()
main = do args <- getArgs
          print <$> (runResourceT $
            CB.sourceFile (args !! 0)
                $$  CB.lines
                =$= CT.decode CT.utf8
                =$= CL.map T.words
                =$  CL.fold (\acc words -> acc + length words) 0)

Here's a version in Typed Racket using match and the for loop macros:这是使用matchfor循环宏的 Typed Racket 中的一个版本:

(: word-count : Input-Port -> Void)
(define (word-count in)
  (define-values (nl nw nc st)
    (for/fold: ([nl : Integer 0] [nw : Integer 0] [nc : Integer 0] 
                [state : (U 'in 'out) 'out])
      ([c (in-input-port-chars in)])
      (match* (c state)
        [(#\newline _) (values (add1 nl) nw (add1 nc) 'out)]
        [((? char-whitespace?) _)
         (values (add1 nl) nw (add1 nc) 'out)]
        [(_ 'out) (values nl (add1 nw) (add1 nc) 'in)]
        [(_ _) (values nl nw (add1 nc) state)])))
  (printf "nl: ~s, nw: ~s, nc: ~s\n" nl nw nc))

Here is a Haskell implementation, where I have tried to stay close to the approach followed by the original C program.这是一个 Haskell 实现,我在其中尝试与原始 C 程序遵循的方法保持接近。 Iterations often become fold operations, with state-containing variables ending up as the first argument to the operation passed to fold .迭代通常成为折叠操作,包含状态的变量最终作为传递给fold的操作的第一个参数。

-- Count characters, words, and lines in an input string.
wordCount::String->(Int, Int, Int)
wordCount str = (c,w,l)
  where (inWord,c,w,l) = foldl op (False,0,0,1) str
          where op (inWord,c,w,l) next | next == '\n' = (False,c+1,w,l+1)
                                       | next == '\t' || next == ' ' = (False,c+1,w,l)
                                       | inWord == False = (True,c+1,w+1,l)
                                       | otherwise = (True,c+1,w,l)

main = interact $ show . wordCount

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM