Converting an imperative algorithm into functional style

Question

I wrote a simple procedure to calculate the average of the test coverage of some specific packages in a Java project. The raw data in a huge html file is like this:

<body>  
package pkg1 <line_coverage>11/111,<branch_coverage>44/444<end>  
package pkg2 <line_coverage>22/222,<branch_coverage>55/555<end>  
package pkg3 <line_coverage>33/333,<branch_coverage>66/666<end>  
...   
</body>

Given the specified packages "pkg1" and "pkg3", for example, the average line coverage is:

(11+33)/(111+333)

and average branch coverage is:

(44+66)/(444+666)

I wrote the follow procedure to get the result and it works well. But how to implement this calculation in a functional style? Something like "(x,y) for x in ... for b in ... if...". I know a little Erlang, Haskell and Clojure, So solutions in these languages are also appreciated. Thanks a lot!

from __future__ import division
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for line in datafile:
    for pkg in core_pkgs:
        ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
        match = ptn.match(line)
        if match is not None:
            cvln, tlln, cvbh, tlbh = match.groups()
            covered_lines += int(cvln)
            total_lines += int(tlln)
            covered_branches += int(cvbh)
            total_branches += int(tlbh)
print 'Line coverage:', '{:.2%}'.format(covered_lines / total_lines)
print 'Branch coverage:', '{:.2%}'.format(covered_branches/total_branches)

Answer 1

Down below you can find my Haskell solution. I will try to explain the important points I went through as I wrote it.

First you will find that I created a data structure for coverage data. It's generally a good idea to create data structures to represent whatever data you want to handle. This is in part because it makes it easier to design your code when you can think in terms of whatever you are designing – closely related to functional programming philosophies, and in part because it can eliminate a few bugs where you think you are doing something but are in actuality doing something else.
Related to the point before: The first thing I do is to convert the string-represented data into my own data structure. When you are doing functional programming, you are often doing things in "sweeps." You don't have a single function that converts data to your format, filters out the unwanted data and summarises the result. You have three different functions for each of those tasks, and you do them one at a time!
This is because functions are very composable , ie if you have three different ones, you can stick them together to form a single one if you want to. If you start with a single one, it is very difficult to take it apart to form three different ones.
The actual workings of the conversion function is actually quite uninteresting unless you are specifically doing Haskell. All it does is try to match each string with a regex, and if it succeeds, it adds the coverage data to the resulting list.
Again, mad composition is about to happen. I don't create a function to loop over a list of coverages and sum them up. I create a single function to sum two coverages, because I know I can use it together with the specialised fold loop (which is sort of like a for loop on steroids) to summarise all coverages in a list. There's no need for me to reinvent the wheel and create a loop myself.
Besides, my sumCoverages function works with a lot of specialised loops, so I don't have to write a ton of functions, I just stick my single function into a ton of pre-made library functions!
In the main function you will see what I mean by programming in "sweeps" or "passes" over the data. First I convert it to the internal format, then I filter out the unwanted data, then I summarise the remaining data. These are completely independent computations. That's functional programming.
You will also notice that I use two specialised loops there, filter and fold . This means that I don't have to write any loops myself, I just stick in a function to those standard library loops and let those take it from there.

import Data.Maybe (catMaybes)
import Data.List (foldl')
import Text.Printf (printf)
import Text.Regex (matchRegex, mkRegex)

corePkgs = ["d", "f"]

stats = [
  "d>11/23d>34/89d",
  "e>25/65e>13/25e",
  "f>36/92f>19/76"
  ]

format = mkRegex ".*(\\w+).*>([0-9]+)/([0-9]+).*>([0-9]+)/([0-9]+).*"


-- It might be a good idea to define a datatype for coverage data.
-- A bit of coverage data is defined as the name of the package it
-- came from, the lines covered, the total amount of lines, the
-- branches covered and the total amount of branches.
data Coverage = Coverage String Int Int Int Int


-- Then we need a way to convert the string data into a list of
-- coverage data. We do this by regex. We try to match on each
-- string in the list, and then we choose to keep only the successful
-- matches. Returned is a list of coverage data that was represented
-- by the strings.
convert :: [String] -> [Coverage]
convert = catMaybes . map match
  where match line = do
          [name, cl, tl, cb, tb] <- matchRegex format line
          return $ Coverage name (read cl) (read tl) (read cb) (read tb)


-- We need a way to summarise two coverage data bits. This can of course also
-- be used to summarise entire lists of coverage data, by folding over it.
sumCoverage (Coverage nameA clA tlA cbA tbA) (Coverage nameB clB tlB cbB tbB) =
  Coverage (nameA ++ nameB ++ ",") (clA + clB) (tlA + tlB) (cbA + cbB) (tbA + tbB)


main = do
      -- First we need to convert the strings to coverage data
  let coverageData = convert stats
      -- Then we want to filter out only the relevant data
      relevantData = filter (\(Coverage name _ _ _ _) -> name `elem` corePkgs) coverageData
      -- Then we need to summarise it, but we are only interested in the numbers
      Coverage _ cl tl cb tb = foldl' sumCoverage (Coverage "" 0 0 0 0) relevantData

  -- So we can finally print them!
  printf "Line coverage: %.2f\n" (fromIntegral cl / fromIntegral tl :: Double)
  printf "Branch coverage: %.2f\n" (fromIntegral cb / fromIntegral tb :: Double)

Answer 2

Here are some quickly-hacked, untested ideas applied to your code:

import numpy as np
import re

datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0

for pkg in core_pkgs:
    ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
    matches = map(datafile, ptn.match)
    statsList = [map(int, match.groups()) for match in matches if matches]
    # statsList is a list of [cvln, tlln, cvbh, tlbh]
    stats = np.array(statsList)
    covered_lines, total_lines, covered_branches, total_branches = stats.sum(axis=1)

Well, as you can see I haven't bothered to finish off the remaining loop, but I think the point is made by now. There's certainly a lot more than one way to do this; I elected to show off map() (which some will say makes this less efficient, and it probably does), as well as NumPy to get the (admittedly light) math done.

Answer 3

This is the corresponding Clojure solution:

(defn extract-data
  "extract 4 integer from a string line according to a package name"
  [pkg line]
  (map read-string
       (rest (first
              (re-seq
               (re-pattern
                (str pkg ".*>(\\d+)/(\\d+).*>(\\d+)/(\\d+)"))
               line)))))

(defn scan-lines-by-pkg
  "scan all string lines and extract all data as integer sequences
    according to package names"
  [pkgs lines]
  (filter seq (for [pkg pkgs
                    line lines]
                (extract-data pkg line))))

(defn sum-data
  "add all data in valid lines together"
  [pkgs lines]
  (apply map + (scan-lines-by-pkg pkgs lines)))

(defn get-percent
  [covered all]
  (str (format "%.2f" (float (/ (* covered 100) all))) "%"))

(defn get-cov
  [pkgs lines]
  {:line-cov (apply get-percent (take 2 (sum-data pkgs lines)))
    :branch-cov (apply get-percent (drop 2 (sum-data pkgs lines)))})

(get-cov ["d" "f"] ["abc" "d>11/23d>34/89d" "e>25/65e>13/25e" "f>36/92f>19/76"])

Converting an imperative algorithm into functional style

Question

3 answers

solution1
3 2013-09-29 11:42:06

solution2
1 2013-09-29 10:16:39

solution3
0 ACCPTED 2013-10-14 13:03:48

Converting an imperative algorithm into functional style

Question

3 answers

solution1 3 2013-09-29 11:42:06

solution2 1 2013-09-29 10:16:39

solution3 0 ACCPTED 2013-10-14 13:03:48

solution1
3 2013-09-29 11:42:06

solution2
1 2013-09-29 10:16:39

solution3
0 ACCPTED 2013-10-14 13:03:48