如何使用火花數據集 api 在火花中並行化

Question

我正在使用帶有 Java 8 的 spark-sql-2.4.1v。

我有如下數據列

val df_data = Seq(
  ("Indus_1","Indus_1_Name","Country1", "State1",12789979),
  ("Indus_2","Indus_2_Name","Country1", "State2",21789933),
  ("Indus_3","Indus_3_Name","Country1", "State3",21789978),
  ("Indus_4","Indus_4_Name","Country2", "State1",41789978),
  ("Indus_5","Indus_5_Name","Country3", "State3",27789978),
  ("Indus_6","Indus_6_Name","Country1", "State1",27899790),
  ("Indus_7","Indus_7_Name","Country3", "State1",27899790),
  ("Indus_8","Indus_8_Name","Country1", "State2",27899790),
  ("Indus_9","Indus_9_Name","Country4", "State1",27899790)
  ).toDF("industry_id","industry_name","country","state","revenue");

鑒於以下輸入列表：

val countryList = Seq("Country1","Country2");
val stateMap = Map("Country1" -> {"State1","State2"}, "Country2" -> {"State2","State3"});

在 Spark 工作中，對於每個 state 的每個國家/地區，我需要計算幾個行業的總收入。

在其他語言中，我們使用 for 循環。

IE

for( country <- countryList ){
   for( state <- stateMap.get(country){
   // do some calculation for each state industries
   }
}

在火花中，我理解我們應該這樣做，即所有執行者都沒有被這樣做。 那么處理這個問題的正確方法是什么？

Answer 1

這真的取決於你想做什么，如果你不需要在州（國家）之間共享 state，那么你應該創建你的 DataFrame 每一行是（國家，州）然后你可以控制多少行並行處理（num 個分區和 num 個內核）。

Answer 2

您可以使用flatMapValues創建鍵值對，然后在.map步驟中進行計算。

scala> val data = Seq(("country1",Seq("state1","state2","state3")),("country2",Seq("state1","state2")))
scala> val rdd = sc.parallelize(data)
scala> val rdd2 = rdd.flatMapValues(s=>s)

scala> rdd2.foreach(println(_))
(country1,state1)
(country2,state1)
(country1,state2)
(country2,state2)
(country1,state3)

在這里可以進行操作，我給每個state加了#

scala> rdd2.map(s=>(s._1,s._2+"#")).foreach(println(_))
(country1,state1#)
(country1,state2#)
(country1,state3#)
(country2,state1#)
(country2,state2#)

Answer 3

我在您的示例數據中添加了一些額外的行來區分聚合。 我使用了 scala 並行收集，對於每個國家/地區，它將獲得狀態，然后使用這些值過濾給定的 dataframe 然后進行聚合，最后它將加入所有結果。

scala> val df = Seq(
     |   ("Indus_1","Indus_1_Name","Country1", "State1",12789979),
     |   ("Indus_2","Indus_2_Name","Country1", "State2",21789933),
     |   ("Indus_2","Indus_2_Name","Country1", "State2",31789933),
     |   ("Indus_3","Indus_3_Name","Country1", "State3",21789978),
     |   ("Indus_4","Indus_4_Name","Country2", "State1",41789978),
     |   ("Indus_4","Indus_4_Name","Country2", "State2",41789978),
     |   ("Indus_4","Indus_4_Name","Country2", "State2",81789978),
     |   ("Indus_4","Indus_4_Name","Country2", "State3",41789978),
     |   ("Indus_4","Indus_4_Name","Country2", "State3",51789978),
     |   ("Indus_5","Indus_5_Name","Country3", "State3",27789978),
     |   ("Indus_6","Indus_6_Name","Country1", "State1",27899790),
     |   ("Indus_7","Indus_7_Name","Country3", "State1",27899790),
     |   ("Indus_8","Indus_8_Name","Country1", "State2",27899790),
     |   ("Indus_9","Indus_9_Name","Country4", "State1",27899790)
     |   ).toDF("industry_id","industry_name","country","state","revenue")
df: org.apache.spark.sql.DataFrame = [industry_id: string, industry_name: string ... 3 more fields]

scala> val countryList = Seq("Country1","Country2","Country4","Country5");
countryList: Seq[String] = List(Country1, Country2, Country4, Country5)

scala> val stateMap = Map("Country1" -> ("State1","State2"), "Country2" -> ("State2","State3"),"Country3" -> ("State31","State32"));
stateMap: scala.collection.immutable.Map[String,(String, String)] = Map(Country1 -> (State1,State2), Country2 -> (State2,State3), Country3 -> (State31,State32))

scala>

scala> :paste
// Entering paste mode (ctrl-D to finish)

countryList
.par
.filter(cn => stateMap.exists(_._1 == cn))
.map(country => (country,stateMap(country)))
.map{data =>
    df.filter($"country" === data._1 && ($"state" === data._2._1 || $"state" === data._2._2)).groupBy("country","state","industry_name").agg(sum("revenue").as("total_revenue"))
}.reduce(_ union _).show(false)


// Exiting paste mode, now interpreting.

+--------+------+-------------+-------------+
|country |state |industry_name|total_revenue|
+--------+------+-------------+-------------+
|Country1|State2|Indus_8_Name |27899790     |
|Country1|State1|Indus_6_Name |27899790     |
|Country1|State2|Indus_2_Name |53579866     |
|Country1|State1|Indus_1_Name |12789979     |
|Country2|State3|Indus_4_Name |93579956     |
|Country2|State2|Indus_4_Name |123579956    |
+--------+------+-------------+-------------+


scala>

編輯 - 1：將 Agg 代碼分隔到不同的 function 塊中。

scala> def processDF(data:(String,(String,String)),adf:DataFrame) = adf.filter($"country" === data._1 && ($"state" === data._2._1 || $"state" === data._2._2)).groupBy("country","state","industry_name").agg(sum("revenue").as("total_revenue"))
processDF: (data: (String, (String, String)), adf: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame

scala> :paste
// Entering paste mode (ctrl-D to finish)

countryList.
par
.filter(cn => stateMap.exists(_._1 == cn))
.map(country => (country,stateMap(country)))
.map(data => processDF(data,df))
.reduce(_ union _)
.show(false)


// Exiting paste mode, now interpreting.

+--------+------+-------------+-------------+
|country |state |industry_name|total_revenue|
+--------+------+-------------+-------------+
|Country1|State2|Indus_8_Name |27899790     |
|Country1|State1|Indus_6_Name |27899790     |
|Country1|State2|Indus_2_Name |53579866     |
|Country1|State1|Indus_1_Name |12789979     |
|Country2|State3|Indus_4_Name |93579956     |
|Country2|State2|Indus_4_Name |123579956    |
+--------+------+-------------+-------------+


scala>

如何使用火花數據集 api 在火花中並行化

問題描述

3 個解決方案

解決方案1
1 2020-04-15 10:41:17

解決方案2
1 2020-04-15 10:43:41

解決方案3
1 已采納 2020-04-20 17:39:54

如何使用火花數據集 api 在火花中並行化

問題描述

3 個解決方案

解決方案1 1 2020-04-15 10:41:17

解決方案2 1 2020-04-15 10:43:41

解決方案3 1 已采納 2020-04-20 17:39:54

解決方案1
1 2020-04-15 10:41:17

解決方案2
1 2020-04-15 10:43:41

解決方案3
1 已采納 2020-04-20 17:39:54