New to Scala and trying to read the input raw data to generate a map with groupBy on multiple fields.
Sample raw data:
date,uid,site,success
2014-07-14,userA,google,1
2014-07-14,userB,google,1
2014-07-14,userC,yahoo,1
2014-07-14,userD,facebook,1
I want to report the number of distinct users per site for each date ie,
2014-07-14,google,2
2014-07-14,yahoo,1
2014-07-14,facebook,1
For this purpose, I'm trying to use groupBy on date and site fields with value as uid. Once I have this data structure, I can iterate over the map and compute the distinct map values. Can anyone point me to how to generate the data structure?
Thanks!
I hope I understood you correctly. Here is a full example.
case class Data(date: String, uid: String, site: String, success: Int)
val sampleData = List(
Data("2014-07-14","userA","google",1),
Data("2014-07-14","userA","google",1),
Data("2014-07-14","userB","google",1),
Data("2014-07-14","userC","yahoo",1),
Data("2014-07-14","userD","facebook",1)
)
sampleData.groupBy(_.date).map
{case (date, datelist) => (date, datelist.groupBy(_.site).map
{case (site, sitelist) => (site, sitelist.groupBy(_.uid).size)})}
The output is: Map(2014-07-14 -> Map(google -> 2, yahoo -> 1, facebook -> 1))
Basically you get a Map for each date, that contains the accesses to sites from distinct users. Notice that the 2 accesses from userA
count as 1.
sitelist.groupBy(_.uid).size
counts the distinct accesses by uid
.
Edit Yes, it is possible without an extra data-structure. You just have to deal with the indices of arrays now.
val fileText = """2014-07-14,userA,google,1
2014-07-14,userA,google,1
2014-07-14,userA,google,1
2014-07-14,userB,google,1
2014-07-14,userC,yahoo,1
2014-07-14,userD,facebook,1""".stripMargin
fileText.lines.map(_.split(",")).toList.groupBy(_(0)).map
{case (date, datelist) => (date, datelist.groupBy(_(2)).map
{case (site, sitelist) => (site, sitelist.groupBy(_(1)).size)})}
Discarding the header line for clarity, a possible implementation is the following:
val text = """2014-07-14,userA,google,1
|2014-07-14,userA,google,1
|2014-07-14,userB,google,1
|2014-07-14,userC,yahoo,1
|2014-07-16,userC,yahoo,1
|2014-07-14,userD,facebook,1
|2014-07-14,userE,facebook,1
|""".stripMargin
val uniqueUsersByDateSite: Map[(String, String), Int] = text.lines.map {
line =>
val tokens = line.split(",")
(tokens(0), tokens(1), tokens(2))
}.toSet.groupBy {
tuple: (String, String, String) =>
(tuple._1, tuple._3)
}.mapValues {
_.size
}
By creating a set of tuples (date, uid, site)
, we collect an item for each unique user for a site on a specific date.
The groupBy
method then collects by (date, site)
, transforming N items for the same date and site to a map entry, containing a number of items corresponding to the number of unique users for the corresponding date and site.
The final mapValue
method achieve the desired result:
Map((2014-07-16,yahoo) -> 1, (2014-07-14,facebook) -> 2, (2014-07-14,google) -> 2, (2014-07-14,yahoo) -> 1)
Answer posted @Kigyo seems pretty good, But I can think you can extend it a little : So, assuming this data structure :
case class Data(date: String, uid: String, site: String, success: Int)
val sampleData = List(
Data("2014-07-14","userA","google",1),
Data("2014-07-14","userA","google",1),
Data("2014-07-14","userB","google",1),
Data("2014-07-14","userC","yahoo",1),
Data("2014-07-14","userD","facebook",1)
)
you can achieve what you want by :
list.groupBy((_.date , _.site)).collect{ case (a , b : List[Data]) =>(a._1 , a._2 , b.map(_.success).sum) } ;
which returns a list of Tuple3, just like you wanted
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.