This question is quite relevant, but is 2 years old: In memory OLAP engine in Java
I would like to create a pivot-table like matrix from a given tabular dataset, in memory
eg an age by marital status count (rows are age, columns are marital status).
The input : List of People, with age and some Boolean property (eg married),
The desired output : count of People, by age (row) and isMarried (column)
case class Person(val age:Int, val isMarried:Boolean)
...
val people:List[Person] = ... //
val peopleByAge = people.groupBy(_.age) //only by age
val peopleByMaritalStatus = people.groupBy(_.isMarried) //only by marital status
I managed to do it the naive way, first grouping by age, then map
which is doing a count
by marital status, and outputs the result, then I foldRight
to aggregate
TreeMap(peopleByAge.toSeq: _*).map(x => {
val age = x._1
val rows = x._2
val numMarried = rows.count(_.isMarried())
val numNotMarried = rows.length - numMarried
(age, numMarried, numNotMarried)
}).foldRight(List[FinalResult]())(row,list) => {
val cumMarried = row._2+
(if (list.isEmpty) 0 else list.last.cumMarried)
val cumNotMarried = row._3 +
(if (list.isEmpty) 0 else l.last.cumNotMarried)
list :+ new FinalResult(row._1, row._2, row._3, cumMarried,cumNotMarried)
}.reverse
I don't like the above code, it's not efficient, hard to read, and I'm sure there is a better way.
How do I groupBy "both"? and how do I do a count for each subgroup, eg
How many people are exactly 30 years old and married?
Another question, is how do I do a running total, to answer the question:
How many people above 30 are married?
Edit:
Thank you for all the great answers.
just to clarify, I would like the output to include a "table" with the following columns
Not only answering those specific queries, but to produce a report that will allow answering all such type of questions.
You can
val groups = people.groupBy(p => (p.age, p.isMarried))
and then
val thirty_and_married = groups((30, true))._2
val over_thirty_and_married_count =
groups.filterKeys(k => k._1 > 30 && k._2).map(_._2.length).sum
Here is an option that is a little more verbose, but does this in a generic fashion instead of using strict data types. You could of course use generics to make this nicer, but i think you get the idea.
/** Creates a new pivot structure by finding correlated values
* and performing an operation on these values
*
* @param accuOp the accumulator function (e.g. sum, max, etc)
* @param xCol the "x" axis column
* @param yCol the "y" axis column
* @param accuCol the column to collect and perform accuOp on
* @return a new Pivot instance that has been transformed with the accuOp function
*/
def doPivot(accuOp: List[String] => String)(xCol: String, yCol: String, accuCol: String) = {
// create list of indexes that correlate to x, y, accuCol
val colsIdx = List(xCol, yCol, accuCol).map(headers.getOrElse(_, 1))
// group by x and y, sending the resulting collection of
// accumulated values to the accuOp function for post-processing
val data = body.groupBy(row => {
(row(colsIdx(0)), row(colsIdx(1)))
}).map(g => {
(g._1, accuOp(g._2.map(_(colsIdx(2)))))
}).toMap
// get distinct axis values
val xAxis = data.map(g => {g._1._1}).toList.distinct
val yAxis = data.map(g => {g._1._2}).toList.distinct
// create result matrix
val newRows = yAxis.map(y => {
xAxis.map(x => {
data.getOrElse((x,y), "")
})
})
// collect it with axis labels for results
Pivot(List((yCol + "/" + xCol) +: xAxis) :::
newRows.zip(yAxis).map(x=> {x._2 +: x._1}))
}
my Pivot type is pretty basic:
class Pivot(val rows: List[List[String]]) {
val headers = rows.head.zipWithIndex.toMap
val body = rows.tail
...
}
And to test it, you could do something like this:
val marriedP = Pivot(
List(
List("Name", "Age", "Married"),
List("Bill", "42", "TRUE"),
List("Heloise", "47", "TRUE"),
List("Thelma", "34", "FALSE"),
List("Bridget", "47", "TRUE"),
List("Robert", "42", "FALSE"),
List("Eddie", "42", "TRUE")
)
)
def accum(values: List[String]) = {
values.map(x => {1}).sum.toString
}
println(marriedP + "\n")
println(marriedP.doPivot(accum)("Age", "Married", "Married"))
Which yields:
Name Age Married
Bill 42 TRUE
Heloise 47 TRUE
Thelma 34 FALSE
Bridget 47 TRUE
Robert 42 FALSE
Eddie 42 TRUE
Married/Age 47 42 34
TRUE 2 2
FALSE 1 1
The nice thing is that you can use currying to pass in any function for the values like you would in a traditional excel pivot table.
More can be found here: https://github.com/vinsonizer/pivotfun
I think it would be better to use the count
method on List
s directly
For question 1
people.count { p => p.age == 30 && p.isMarried }
For question 2
people.count { p => p.age > 30 && p.isMarried }
If you also want to actual groups of people who conform to those predicates use filter.
people.filter { p => p.age > 30 && p.isMarried }
You could probably optimise these by doing the traversal only once but is that a requirement?
You can group using a tuple:
val res1 = people.groupBy(p => (p.age, p.isMarried)) //or
val res2 = people.groupBy(p => (p.age, p.isMarried)).mapValues(_.size) //if you dont care about People instances
You can answer both question like that:
res2.getOrElse((30, true), 0)
res2.filter{case (k, _) => k._1 > 30 && k._2}.values.sum
res2.filterKeys(k => k._1 > 30 && k._2).values.sum // nicer with filterKeys from Rex Kerr's answer
You could answer both questions with a method count on List:
people.count(p => p.age == 30 && p.isMarried)
people.count(p => p.age > 30 && p.isMarried)
Or using filter and size:
people.filter(p => p.age == 30 && p.isMarried).size
people.filter(p => p.age > 30 && p.isMarried).size
edit: slightly cleaner version of your code:
TreeMap(peopleByAge.toSeq: _*).map {case (age, ps) =>
val (married, notMarried) = ps.span(_.isMarried)
(age, married.size, notMarried.size)
}.foldLeft(List[FinalResult]()) { case (acc, (age, married, notMarried)) =>
def prevValue(f: (FinalResult) => Int) = acc.headOption.map(f).getOrElse(0)
new FinalResult(age, married, notMarried, prevValue(_.cumMarried) + married, prevValue(_.cumNotMarried) + notMarried) :: acc
}.reverse
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.