简体   繁体   中英

Haskell pattern matching concept behind text splitter

I want to know the pattern matching concept behind this code snippet:

 split :: String -> Char -> [String]
 split [] delim = [""]
 split (c:cs) delim
     | c == delim = "" : rest
     | otherwise = (c : head rest) : tail rest
       where
         rest = split cs delim

I know that head returns the 1st element of the list and tail returns the rest. But I still cannot understand the functionality of this. This takes a string and breaks it into a list of strings from a given character.

Maybe it's clearer in the following form:

split [] delim = [""]    -- a list containing only an empty String
split (c:cs) delim = let (firstWord:moreWords) = split cs delim
                     in if c == delim
                           then "" : firstWord : moreWords
                           else (c:firstWord) : moreWords

The function traverses the input string, comparing each character with the delimiter. If the current character is not the delimiting character, it is tacked on the front of the first word (which may be empty) resulting from splitting the remainder of the string, if it is the delimiting character, it adds an empty string to the front of the result of splitting the remainder.

For example, the evaluation of split "abc cde" ' ' proceeds like

split "abc cde" ' '
    ~> 'a' == ' ' ? No, next guard
    ~> ('a' : something) : somethingElse

where something and somethingElse will be determined later by splitting the remainder "bc cde". After looking at the first character, it's been determined that whatever the final result is, its first entry starts with "bc cde". After looking at the first character, it's been determined that whatever the final result is, its first entry starts with 'a'`. Going on to determine the rest,

split "bc cde" ' '
    ~> ('b' : something1) : somethingElse1
       where (something1 : somethingElse1) = split "c cde" ' '

So now the first two characters of the first entry of the result are known. Then from the next step it is determined that something1 starts with 'c' . Then finally we reach a delimiter, that is the case where the first element of the result is determined without reference to later recursive calls, and only the remainder of the result remains to be found in the recursion.

Another way of formulating the algorithm is (thanks @dave4420 for the suggestion)

split input delim = foldr combine [""] input
  where
    combine c rest@(~(wd : wds))
        | c == delim = "" : rest
        | otherwise  = (c : wd) : wds

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM