简体   繁体   中英

Scala — how to create map from raw data to compute distinct values

New to Scala and trying to read the input raw data to generate a map with groupBy on multiple fields.

Sample raw data:

date,uid,site,success
2014-07-14,userA,google,1
2014-07-14,userB,google,1
2014-07-14,userC,yahoo,1
2014-07-14,userD,facebook,1

I want to report the number of distinct users per site for each date ie,

2014-07-14,google,2
2014-07-14,yahoo,1
2014-07-14,facebook,1

For this purpose, I'm trying to use groupBy on date and site fields with value as uid. Once I have this data structure, I can iterate over the map and compute the distinct map values. Can anyone point me to how to generate the data structure?

Thanks!

I hope I understood you correctly. Here is a full example.

case class Data(date: String, uid: String, site: String, success: Int)

val sampleData = List(
  Data("2014-07-14","userA","google",1),
  Data("2014-07-14","userA","google",1),
  Data("2014-07-14","userB","google",1),
  Data("2014-07-14","userC","yahoo",1),
  Data("2014-07-14","userD","facebook",1)
)

sampleData.groupBy(_.date).map
  {case (date, datelist) => (date, datelist.groupBy(_.site).map
    {case (site, sitelist) => (site, sitelist.groupBy(_.uid).size)})}

The output is: Map(2014-07-14 -> Map(google -> 2, yahoo -> 1, facebook -> 1))

Basically you get a Map for each date, that contains the accesses to sites from distinct users. Notice that the 2 accesses from userA count as 1.

 sitelist.groupBy(_.uid).size

counts the distinct accesses by uid .

Edit Yes, it is possible without an extra data-structure. You just have to deal with the indices of arrays now.

val fileText = """2014-07-14,userA,google,1
  2014-07-14,userA,google,1
  2014-07-14,userA,google,1
  2014-07-14,userB,google,1
  2014-07-14,userC,yahoo,1
  2014-07-14,userD,facebook,1""".stripMargin

fileText.lines.map(_.split(",")).toList.groupBy(_(0)).map
  {case (date, datelist) => (date, datelist.groupBy(_(2)).map
    {case (site, sitelist) => (site, sitelist.groupBy(_(1)).size)})}

Discarding the header line for clarity, a possible implementation is the following:

val text = """2014-07-14,userA,google,1
            |2014-07-14,userA,google,1
            |2014-07-14,userB,google,1
            |2014-07-14,userC,yahoo,1
            |2014-07-16,userC,yahoo,1
            |2014-07-14,userD,facebook,1
            |2014-07-14,userE,facebook,1
            |""".stripMargin

val uniqueUsersByDateSite: Map[(String, String), Int] = text.lines.map {
  line =>
    val tokens = line.split(",")
    (tokens(0), tokens(1), tokens(2))
}.toSet.groupBy {
  tuple: (String, String, String) =>
    (tuple._1, tuple._3)
}.mapValues {
  _.size
}

By creating a set of tuples (date, uid, site) , we collect an item for each unique user for a site on a specific date.

The groupBy method then collects by (date, site) , transforming N items for the same date and site to a map entry, containing a number of items corresponding to the number of unique users for the corresponding date and site.

The final mapValue method achieve the desired result:

Map((2014-07-16,yahoo) -> 1, (2014-07-14,facebook) -> 2, (2014-07-14,google) -> 2, (2014-07-14,yahoo) -> 1)

Answer posted @Kigyo seems pretty good, But I can think you can extend it a little : So, assuming this data structure :

case class Data(date: String, uid: String, site: String, success: Int)
val sampleData = List(
  Data("2014-07-14","userA","google",1),
  Data("2014-07-14","userA","google",1),
  Data("2014-07-14","userB","google",1),
  Data("2014-07-14","userC","yahoo",1),
  Data("2014-07-14","userD","facebook",1)
)

you can achieve what you want by :

list.groupBy((_.date , _.site)).collect{ case (a , b : List[Data]) =>(a._1 , a._2 , b.map(_.success).sum) } ;

which returns a list of Tuple3, just like you wanted

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM