[英]Using Aggregate and group by on spark Dataset api
JavaRDD<Person> prdd = sc.textFile("c:\\fls\\people.txt").map(
new Function<String, Person>() {
public Person call(String line) throws Exception {
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
person.setAge(Integer.parseInt(parts[1].trim()));
person.setSal(Integer.parseInt(parts[2].trim()));
return person;
}
});
RDD<Person>personRDD = prdd.toRDD(prdd);
Dataset<Person> dss= sqlContext.createDataset(personRDD , Encoders.bean(Person.class));
GroupedDataset<Row, Person> dq=dss.groupBy(new Column("name"));
I have to calculate sum of age and salary group by name on the dataset. 我必须按数据集上的名称计算年龄和薪资组的总和。 Please help how to query dataset ?
请帮助如何查询数据集? I tried using GroupedDataset but don't know how to proceed with it.
我尝试使用GroupedDataset,但不知道如何进行。 Thanks
谢谢
You can register the JavaRDD prdd as a table and then used in SQL statements 您可以将JavaRDD prdd注册为表,然后在SQL语句中使用
`DataFrame schemaPeople = sqlContext.createDataFrame(prdd, Person.class); `DataFrame schemaPeople = sqlContext.createDataFrame(prdd,Person.class); schemaPeople.registerTempTable("people");
schemaPeople.registerTempTable(“ people”);
// SQL can be run over RDDs that have been registered as tables. // SQL可以在已注册为表的RDD上运行。
DataFrame teenagers = sqlContext.sql("SELECT sum(age),sum(salary) FROM people group by name) DataFrame青少年= sqlContext.sql(“按名称从人员分组中选择总和(年龄),总和(薪水)”
// The results of SQL queries are DataFrames and support all the normal RDD operations.` // SQL查询的结果是DataFrames,并支持所有正常的RDD操作。
Read more : http://spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-queries-programmatically 阅读更多: http : //spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-queries-programmatically
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.