簡體   English   中英

使用Apache Spark和Java將CSV解析為DataFrame / DataSet

[英]Parse CSV as DataFrame/DataSet with Apache Spark and Java

我是新來的火花,我想使用group-by&reduce從CSV中找到以下內容(使用一行):

  Department, Designation, costToCompany, State
  Sales, Trainee, 12000, UP
  Sales, Lead, 32000, AP
  Sales, Lead, 32000, LA
  Sales, Lead, 32000, TN
  Sales, Lead, 32000, AP
  Sales, Lead, 32000, TN 
  Sales, Lead, 32000, LA
  Sales, Lead, 32000, LA
  Marketing, Associate, 18000, TN
  Marketing, Associate, 18000, TN
  HR, Manager, 58000, TN

我想通過Department,Designation,State簡化包含sum(costToCompany)TotalEmployeeCount的附加列的CSV

應得到如下結果:

  Dept, Desg, state, empCount, totalCost
  Sales,Lead,AP,2,64000
  Sales,Lead,LA,3,96000  
  Sales,Lead,TN,2,64000

有沒有辦法使用轉換和動作來實現這一點。 或者我們應該進行RDD操作?

程序

  • 創建一個類(模式)來封裝你的結構(它不是方法B所必需的,但如果你使用Java,它會使你的代碼更容易閱讀)

     public class Record implements Serializable { String department; String designation; long costToCompany; String state; // constructor , getters and setters } 
  • 加載CVS(JSON)文件

     JavaSparkContext sc; JavaRDD<String> data = sc.textFile("path/input.csv"); //JavaSQLContext sqlContext = new JavaSQLContext(sc); // For previous versions SQLContext sqlContext = new SQLContext(sc); // In Spark 1.3 the Java API and Scala API have been unified JavaRDD<Record> rdd_records = sc.textFile(data).map( new Function<String, Record>() { public Record call(String line) throws Exception { // Here you can use JSON // Gson gson = new Gson(); // gson.fromJson(line, Record.class); String[] fields = line.split(","); Record sd = new Record(fields[0], fields[1], fields[2].trim(), fields[3]); return sd; } }); 

此時您有兩種方法:

A. SparkSQL

  • 注冊表(使用您定義的Schema類)

     JavaSchemaRDD table = sqlContext.applySchema(rdd_records, Record.class); table.registerAsTable("record_table"); table.printSchema(); 
  • 使用所需的Query-group-by查詢表

     JavaSchemaRDD res = sqlContext.sql(" select department,designation,state,sum(costToCompany),count(*) from record_table group by department,designation,state "); 
  • 在這里,您還可以使用SQL方法執行您想要的任何其他查詢

B.火花

  • 使用復合鍵映射: DepartmentDesignationState

     JavaPairRDD<String, Tuple2<Long, Integer>> records_JPRDD = rdd_records.mapToPair(new PairFunction<Record, String, Tuple2<Long, Integer>>(){ public Tuple2<String, Tuple2<Long, Integer>> call(Record record){ Tuple2<String, Tuple2<Long, Integer>> t2 = new Tuple2<String, Tuple2<Long,Integer>>( record.Department + record.Designation + record.State, new Tuple2<Long, Integer>(record.costToCompany,1) ); return t2; } 

    });

  • reduceByKey使用復合鍵, costToCompany列,並按鍵累計記錄數

     JavaPairRDD<String, Tuple2<Long, Integer>> final_rdd_records = records_JPRDD.reduceByKey(new Function2<Tuple2<Long, Integer>, Tuple2<Long, Integer>, Tuple2<Long, Integer>>() { public Tuple2<Long, Integer> call(Tuple2<Long, Integer> v1, Tuple2<Long, Integer> v2) throws Exception { return new Tuple2<Long, Integer>(v1._1 + v2._1, v1._2+ v2._2); } }); 

可以使用Spark內置CSV閱讀器解析CSV文件 它將在成功讀取文件時返回DataFrame / DataSet。 在DataFrame / DataSet之上,您可以輕松應用類似SQL的操作。

使用Spark 2.x(及以上)與Java

創建SparkSession對象又稱spark

import org.apache.spark.sql.SparkSession;

SparkSession spark = SparkSession
    .builder()
    .appName("Java Spark SQL Example")
    .getOrCreate();

使用StructType為行創建架構

import org.apache.spark.sql.types.StructType;

StructType schema = new StructType()
    .add("department", "string")
    .add("designation", "string")
    .add("ctc", "long")
    .add("state", "string");

從CSV文件創建數據框並將模式應用於該文件

Dataset<Row> df = spark.read()
    .option("mode", "DROPMALFORMED")
    .schema(schema)
    .csv("hdfs://path/input.csv");

從CSV文件讀取數據的更多選項

現在我們可以通過兩種方式聚合數據

1. SQL方式

在spark sql metastore中注冊表以執行SQL操作

 df.createOrReplaceTempView("employee"); 

在已注冊的數據幀上運行SQL查詢

 Dataset<Row> sqlResult = spark.sql( "SELECT department, designation, state, SUM(ctc), COUNT(department)" + " FROM employee GROUP BY department, designation, state"); sqlResult.show(); //for testing 

我們甚至可以直接在CSV文件上執行SQL,而無需使用Spark SQL創建表


2.對象鏈接或編程或類似Java的方式

為sql函數執行必要的導入

 import static org.apache.spark.sql.functions.count; import static org.apache.spark.sql.functions.sum; 

groupBy / dataset上使用groupByagg對數據執行countsum

 Dataset<Row> dfResult = df.groupBy("department", "designation", "state") .agg(sum("ctc"), count("department")); // After Spark 1.6 columns mentioned in group by will be added to result by default dfResult.show();//for testing 

依賴庫

"org.apache.spark" % "spark-core_2.11" % "2.0.0" 
"org.apache.spark" % "spark-sql_2.11" % "2.0.0"

以下可能不完全正確,但它應該讓您了解如何處理數據。 它不漂亮,應該用case類等替換,但作為如何使用spark api的一個簡單例子,我希望它足夠了:)

val rawlines = sc.textfile("hdfs://.../*.csv")
case class Employee(dep: String, des: String, cost: Double, state: String)
val employees = rawlines
  .map(_.split(",") /*or use a proper CSV parser*/
  .map( Employee(row(0), row(1), row(2), row(3) )

# the 1 is the amount of employees (which is obviously 1 per line)
val keyVals = employees.map( em => (em.dep, em.des, em.state), (1 , em.cost))

val results = keyVals.reduceByKey{ a,b =>
    (a._1 + b._1, b._1, b._2) # (a.count + b.count , a.cost + b.cost )
}

#debug output
results.take(100).foreach(println)

results
  .map( keyval => someThingToFormatAsCsvStringOrWhatever )
  .saveAsTextFile("hdfs://.../results")

或者您可以使用SparkSQL:

val sqlContext = new SQLContext(sparkContext)

# case classes can easily be registered as tables
employees.registerAsTable("employees")

val results = sqlContext.sql("""select dep, des, state, sum(cost), count(*) 
  from employees 
  group by dep,des,state"""

對於JSON,如果您的文本文件每行包含一個JSON對象,則可以使用sqlContext.jsonFile(path)讓Spark SQL將其作為SchemaRDD (將自動推斷該模式)。 然后,您可以將其注冊為表並使用SQL進行查詢。 您還可以手動將文本文件加載為每個記錄包含一個JSON對象的RDD[String] ,並使用sqlContext.jsonRDD(rdd)將其作為SchemaRDD 當您需要預處理數據時, jsonRDD非常有用。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM