[英]How to convert DataFrame to Dataset in Apache Spark in Java?
[英]Parse CSV as DataFrame/DataSet with Apache Spark and Java
我是新來的火花,我想使用group-by&reduce從CSV中找到以下內容(使用一行):
Department, Designation, costToCompany, State
Sales, Trainee, 12000, UP
Sales, Lead, 32000, AP
Sales, Lead, 32000, LA
Sales, Lead, 32000, TN
Sales, Lead, 32000, AP
Sales, Lead, 32000, TN
Sales, Lead, 32000, LA
Sales, Lead, 32000, LA
Marketing, Associate, 18000, TN
Marketing, Associate, 18000, TN
HR, Manager, 58000, TN
我想通過Department,Designation,State簡化包含sum(costToCompany)和TotalEmployeeCount的附加列的CSV
應得到如下結果:
Dept, Desg, state, empCount, totalCost
Sales,Lead,AP,2,64000
Sales,Lead,LA,3,96000
Sales,Lead,TN,2,64000
有沒有辦法使用轉換和動作來實現這一點。 或者我們應該進行RDD操作?
創建一個類(模式)來封裝你的結構(它不是方法B所必需的,但如果你使用Java,它會使你的代碼更容易閱讀)
public class Record implements Serializable { String department; String designation; long costToCompany; String state; // constructor , getters and setters }
加載CVS(JSON)文件
JavaSparkContext sc; JavaRDD<String> data = sc.textFile("path/input.csv"); //JavaSQLContext sqlContext = new JavaSQLContext(sc); // For previous versions SQLContext sqlContext = new SQLContext(sc); // In Spark 1.3 the Java API and Scala API have been unified JavaRDD<Record> rdd_records = sc.textFile(data).map( new Function<String, Record>() { public Record call(String line) throws Exception { // Here you can use JSON // Gson gson = new Gson(); // gson.fromJson(line, Record.class); String[] fields = line.split(","); Record sd = new Record(fields[0], fields[1], fields[2].trim(), fields[3]); return sd; } });
此時您有兩種方法:
注冊表(使用您定義的Schema類)
JavaSchemaRDD table = sqlContext.applySchema(rdd_records, Record.class); table.registerAsTable("record_table"); table.printSchema();
使用所需的Query-group-by查詢表
JavaSchemaRDD res = sqlContext.sql(" select department,designation,state,sum(costToCompany),count(*) from record_table group by department,designation,state ");
在這里,您還可以使用SQL方法執行您想要的任何其他查詢
使用復合鍵映射: Department
, Designation
, State
JavaPairRDD<String, Tuple2<Long, Integer>> records_JPRDD = rdd_records.mapToPair(new PairFunction<Record, String, Tuple2<Long, Integer>>(){ public Tuple2<String, Tuple2<Long, Integer>> call(Record record){ Tuple2<String, Tuple2<Long, Integer>> t2 = new Tuple2<String, Tuple2<Long,Integer>>( record.Department + record.Designation + record.State, new Tuple2<Long, Integer>(record.costToCompany,1) ); return t2; }
});
reduceByKey使用復合鍵, costToCompany
列,並按鍵累計記錄數
JavaPairRDD<String, Tuple2<Long, Integer>> final_rdd_records = records_JPRDD.reduceByKey(new Function2<Tuple2<Long, Integer>, Tuple2<Long, Integer>, Tuple2<Long, Integer>>() { public Tuple2<Long, Integer> call(Tuple2<Long, Integer> v1, Tuple2<Long, Integer> v2) throws Exception { return new Tuple2<Long, Integer>(v1._1 + v2._1, v1._2+ v2._2); } });
可以使用Spark內置CSV閱讀器解析CSV文件 。 它將在成功讀取文件時返回DataFrame / DataSet。 在DataFrame / DataSet之上,您可以輕松應用類似SQL的操作。
spark
import org.apache.spark.sql.SparkSession;
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL Example")
.getOrCreate();
StructType
為行創建架構 import org.apache.spark.sql.types.StructType;
StructType schema = new StructType()
.add("department", "string")
.add("designation", "string")
.add("ctc", "long")
.add("state", "string");
Dataset<Row> df = spark.read()
.option("mode", "DROPMALFORMED")
.schema(schema)
.csv("hdfs://path/input.csv");
1. SQL方式
在spark sql metastore中注冊表以執行SQL操作
df.createOrReplaceTempView("employee");
在已注冊的數據幀上運行SQL查詢
Dataset<Row> sqlResult = spark.sql( "SELECT department, designation, state, SUM(ctc), COUNT(department)" + " FROM employee GROUP BY department, designation, state"); sqlResult.show(); //for testing
2.對象鏈接或編程或類似Java的方式
為sql函數執行必要的導入
import static org.apache.spark.sql.functions.count; import static org.apache.spark.sql.functions.sum;
在
groupBy
/ dataset上使用groupBy
和agg
對數據執行count
和sum
Dataset<Row> dfResult = df.groupBy("department", "designation", "state") .agg(sum("ctc"), count("department")); // After Spark 1.6 columns mentioned in group by will be added to result by default dfResult.show();//for testing
"org.apache.spark" % "spark-core_2.11" % "2.0.0"
"org.apache.spark" % "spark-sql_2.11" % "2.0.0"
以下可能不完全正確,但它應該讓您了解如何處理數據。 它不漂亮,應該用case類等替換,但作為如何使用spark api的一個簡單例子,我希望它足夠了:)
val rawlines = sc.textfile("hdfs://.../*.csv")
case class Employee(dep: String, des: String, cost: Double, state: String)
val employees = rawlines
.map(_.split(",") /*or use a proper CSV parser*/
.map( Employee(row(0), row(1), row(2), row(3) )
# the 1 is the amount of employees (which is obviously 1 per line)
val keyVals = employees.map( em => (em.dep, em.des, em.state), (1 , em.cost))
val results = keyVals.reduceByKey{ a,b =>
(a._1 + b._1, b._1, b._2) # (a.count + b.count , a.cost + b.cost )
}
#debug output
results.take(100).foreach(println)
results
.map( keyval => someThingToFormatAsCsvStringOrWhatever )
.saveAsTextFile("hdfs://.../results")
或者您可以使用SparkSQL:
val sqlContext = new SQLContext(sparkContext)
# case classes can easily be registered as tables
employees.registerAsTable("employees")
val results = sqlContext.sql("""select dep, des, state, sum(cost), count(*)
from employees
group by dep,des,state"""
對於JSON,如果您的文本文件每行包含一個JSON對象,則可以使用sqlContext.jsonFile(path)
讓Spark SQL將其作為SchemaRDD
(將自動推斷該模式)。 然后,您可以將其注冊為表並使用SQL進行查詢。 您還可以手動將文本文件加載為每個記錄包含一個JSON對象的RDD[String]
,並使用sqlContext.jsonRDD(rdd)
將其作為SchemaRDD
。 當您需要預處理數據時, jsonRDD
非常有用。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.