[英]SPARK : How to create aggregate from RDD[Row] in Scala
如何在RDD / DF中创建列表/地图,以便获得汇总?
我有一个文件,其中每一行都是一个JSON对象:
{
itemId :1122334,
language: [
{
name: [
"US", "FR"
],
value: [
"english", "french"
]
},
{
name: [
"IND"
],
value: [
"hindi"
]
}
],
country: [
{
US: [
{
startTime: 2016-06-06T17: 39: 35.000Z,
endTime: 2016-07-28T07: 00: 00.000Z
}
],
CANADA: [
{
startTime: 2016-06-06T17: 39: 35.000Z,
endTime: 2016-07-28T07: 00: 00.000Z
}
],
DENMARK: [
{
startTime: 2016-06-06T17: 39: 35.000Z,
endTime: 2016-07-28T07: 00: 00.000Z
}
],
FRANCE: [
{
startTime: 2016-08-06T17: 39: 35.000Z,
endTime: 2016-07-28T07: 00: 00.000Z
}
]
}
]
},
{
itemId :1122334,
language: [
{
name: [
"US", "FR"
],
value: [
"english", "french"
]
},
{
name: [
"IND"
],
value: [
"hindi"
]
}
],
country: [
{
US: [
{
startTime: 2016-06-06T17: 39: 35.000Z,
endTime: 2016-07-28T07: 00: 00.000Z
}
],
CANADA: [
{
startTime: 2016-07-06T17: 39: 35.000Z,
endTime: 2016-07-28T07: 00: 00.000Z
}
],
DENMARK: [
{
startTime: 2016-06-06T17: 39: 35.000Z,
endTime: 2016-07-28T07: 00: 00.000Z
}
],
FRANCE: [
{
startTime: 2016-08-06T17: 39: 35.000Z,
endTime: 2016-07-28T07: 00: 00.000Z
}
]
}
]
}
我有匹配的POJO,可从JSON获取值。
import com.mapping.data.model.MappingUtils
import com.mapping.data.model.CountryInfo
val mappingPath = "s3://.../"
val timeStamp = "2016-06-06T17: 39: 35.000Z"
val endTimeStamp = "2016-06-07T17: 39: 35.000Z"
val COUNTRY_US = "US"
val COUNTRY_CANADA = "CANADA"
val COUNTRY_DENMARK = "DENMARK"
val COUNTRY_FRANCE = "FRANCE"
val input = sc.textFile(mappingPath)
输入的是json列表,其中每行都是json,我正在使用MappingUtils映射到POJO类CountryInfo,它负责JSON解析和转换:
val MappingsList = input.map(x=> {
val countryInfo = MappingUtils.getCountryInfoString(x);
(countryInfo.getItemId(), countryInfo)
}).collectAsMap
MappingsList: scala.collection.Map[String,com.mapping.data.model.CountryInfo]
def showCountryInfo(x: Option[CountryInfo]) = x match {
case Some(s) => s
}
但是我需要创建一个DF / RDD,以便可以基于itemId获取国家和语言的汇总。
在给定的示例中,如果该国家/地区的开始时间不小于“ 2016-06-07T17:39:35.000Z”,则该值将为零。
哪种格式最适合创建最终的聚合json:
1. List ?
|-----itemId-------|----country-------------------|-----language---------------------|
| 1122334 | [US, CANADA,DENMARK] | [english,hindi,french] |
| 1122334 | [US,DENMARK] | [english] |
|------------------|------------------------------|----------------------------------|
2. Map ?
|-----itemId-------|----country---------------------------------|-----language---------------------|
| 1122334 | (US,2) (CANADA,1) (DENMARK,2) (FRANCE, 0) |(english,2) (hindi,1) (french,1) |
|.... |
|.... |
|.... |
|------------------|--------------------------------------------|----------------------------------|
我想创建一个最终的json,它的聚合值如下:
{
itemId: "1122334",
country: {
"US" : 2,
"CANADA" : 1,
"DENMARK" : 2,
"FRANCE" : 0
},
language: {
"english" : 2,
"french" : 1,
"hindi" : 1
}
}
我试过List:
val events = sqlContext.sql( "select itemId EventList")
val itemList = events.map(row => {
val itemId = row.getAs[String](1);
val countryInfo = showTitleInfo(MappingsList.get(itemId));
val country = new ListBuffer[String]()
country += if (countryInfo.getCountry().getUS().get(0).getStartTime() < endTimeStamp) COUNTRY_US;
country += if (countryInfo.getCountry().getCANADA().get(0).getStartTime() < endTimeStamp) COUNTRY_CANADA;
country += if (countryInfo.getCountry().getDENMARK().get(0).getStartTime() < endTimeStamp) COUNTRY_DENMARK;
country += if (countryInfo.getCountry().getFRANCE().get(0).getStartTime() < endTimeStamp) COUNTRY_FRANCE;
val languageList = new ListBuffer[String]()
val language = countryInfo.getLanguages().collect.foreach(x => languageList += x.getValue());
Row(itemId, country.toList, languageList.toList)
})
和地图:
val itemList = events.map(row => {
val itemId = row.getAs[String](1);
val countryInfo = showTitleInfo(MappingsList.get(itemId));
val country: Map[String, Int] = Map()
country += if (countryInfo.getCountry().getUS().get(0).getStartTime() < endTimeStamp) ('COUNTRY_US' -> 1) else ('COUNTRY_US' -> 0)
country += if (countryInfo.getCountry().getUS().get(0).getStartTime() < endTimeStamp) ('COUNTRY_CANADA' -> 1) else ('COUNTRY_CANADA' -> 0)
country += if (countryInfo.getCountry().getUS().get(0).getStartTime() < endTimeStamp) ('COUNTRY_DENMARK' -> 1) else ('COUNTRY_DENMARK' -> 0)
country += if (countryInfo.getCountry().getUS().get(0).getStartTime() < endTimeStamp) ('COUNTRY_FRANCE' -> 1) else ('COUNTRY_FRANCE' -> 0)
val language: Map[String, Int] = Map()
countryInfo.getLanguages().collect.foreach(x => language += (x.getValue -> 1)) ;
Row(itemId, country, language)
})
但是两者都被齐柏林飞艇冻结了。 有没有更好的方法来获取聚合为json? List / Map构造最终集合哪个更好?
如果您按照Spark DataFrame / Dataset和Row来重提问题,将很有帮助。 我了解您最终想要使用JSON,但是JSON输入/输出的详细信息是一个单独的问题。
您要查找的函数是Spark SQL聚合函数 (请参阅该页面上的函数组)。 函数collect_list和collect_set是相关的,但是您所需的功能尚未实现。
您可以通过继承org.spark.spark.sql.expressions.UserDefinedAggregateFunction来实现我称之为count_by_value的功能 。 这将需要一些有关Spark SQL工作原理的深入知识。
一旦实现count_by_value ,就可以像这样使用它:
df.groupBy("itemId").agg(count_by_value(df("country")), count_by_value(df("language")))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.