在Spark Dataset API上使用汇总和分组依据

Question

    JavaRDD<Person> prdd = sc.textFile("c:\\fls\\people.txt").map(
          new Function<String, Person>() {
                public Person call(String line) throws Exception {
                  String[] parts = line.split(",");
                  Person person = new Person();
                  person.setName(parts[0]);
                  person.setAge(Integer.parseInt(parts[1].trim()));
                 person.setSal(Integer.parseInt(parts[2].trim()));
                  return person;
                }
              });

       RDD<Person>personRDD =  prdd.toRDD(prdd);
       Dataset<Person> dss= sqlContext.createDataset(personRDD ,               Encoders.bean(Person.class));
   GroupedDataset<Row, Person> dq=dss.groupBy(new Column("name"));

I have to calculate sum of age and salary group by name on the dataset. 我必须按数据集上的名称计算年龄和薪资组的总和。 Please help how to query dataset ? 请帮助如何查询数据集？ I tried using GroupedDataset but don't know how to proceed with it. 我尝试使用GroupedDataset，但不知道如何进行。 Thanks 谢谢

Answer 1

You can register the JavaRDD prdd as a table and then used in SQL statements 您可以将JavaRDD prdd注册为表，然后在SQL语句中使用

`DataFrame schemaPeople = sqlContext.createDataFrame(prdd, Person.class); `DataFrame schemaPeople = sqlContext.createDataFrame（prdd，Person.class）; schemaPeople.registerTempTable("people"); schemaPeople.registerTempTable（“ people”）;

// SQL can be run over RDDs that have been registered as tables. // SQL可以在已注册为表的RDD上运行。

DataFrame teenagers = sqlContext.sql("SELECT sum(age),sum(salary) FROM people group by name) DataFrame青少年= sqlContext.sql（“按名称从人员分组中选择总和（年龄），总和（薪水）”

// The results of SQL queries are DataFrames and support all the normal RDD operations.` // SQL查询的结果是DataFrames，并支持所有正常的RDD操作。

Read more : http://spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-queries-programmatically 阅读更多： http : //spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-queries-programmatically

在Spark Dataset API上使用汇总和分组依据

问题描述

1 个解决方案

解决方案1
0 2016-04-24 16:21:26

在Spark Dataset API上使用汇总和分组依据

问题描述

1 个解决方案

解决方案1 0 2016-04-24 16:21:26

解决方案1
0 2016-04-24 16:21:26