如何使用火花数据集 api 在火花中并行化

Question

我正在使用带有 Java 8 的 spark-sql-2.4.1v。

我有如下数据列

val df_data = Seq(
  ("Indus_1","Indus_1_Name","Country1", "State1",12789979),
  ("Indus_2","Indus_2_Name","Country1", "State2",21789933),
  ("Indus_3","Indus_3_Name","Country1", "State3",21789978),
  ("Indus_4","Indus_4_Name","Country2", "State1",41789978),
  ("Indus_5","Indus_5_Name","Country3", "State3",27789978),
  ("Indus_6","Indus_6_Name","Country1", "State1",27899790),
  ("Indus_7","Indus_7_Name","Country3", "State1",27899790),
  ("Indus_8","Indus_8_Name","Country1", "State2",27899790),
  ("Indus_9","Indus_9_Name","Country4", "State1",27899790)
  ).toDF("industry_id","industry_name","country","state","revenue");

鉴于以下输入列表：

val countryList = Seq("Country1","Country2");
val stateMap = Map("Country1" -> {"State1","State2"}, "Country2" -> {"State2","State3"});

在 Spark 工作中，对于每个 state 的每个国家/地区，我需要计算几个行业的总收入。

在其他语言中，我们使用 for 循环。

IE

for( country <- countryList ){
   for( state <- stateMap.get(country){
   // do some calculation for each state industries
   }
}

在火花中，我理解我们应该这样做，即所有执行者都没有被这样做。 那么处理这个问题的正确方法是什么？

Answer 1

这真的取决于你想做什么，如果你不需要在州（国家）之间共享 state，那么你应该创建你的 DataFrame 每一行是（国家，州）然后你可以控制多少行并行处理（num 个分区和 num 个内核）。

Answer 2

您可以使用flatMapValues创建键值对，然后在.map步骤中进行计算。

scala> val data = Seq(("country1",Seq("state1","state2","state3")),("country2",Seq("state1","state2")))
scala> val rdd = sc.parallelize(data)
scala> val rdd2 = rdd.flatMapValues(s=>s)

scala> rdd2.foreach(println(_))
(country1,state1)
(country2,state1)
(country1,state2)
(country2,state2)
(country1,state3)

在这里可以进行操作，我给每个state加了#

scala> rdd2.map(s=>(s._1,s._2+"#")).foreach(println(_))
(country1,state1#)
(country1,state2#)
(country1,state3#)
(country2,state1#)
(country2,state2#)

Answer 3

我在您的示例数据中添加了一些额外的行来区分聚合。 我使用了 scala 并行收集，对于每个国家/地区，它将获得状态，然后使用这些值过滤给定的 dataframe 然后进行聚合，最后它将加入所有结果。

scala> val df = Seq(
     |   ("Indus_1","Indus_1_Name","Country1", "State1",12789979),
     |   ("Indus_2","Indus_2_Name","Country1", "State2",21789933),
     |   ("Indus_2","Indus_2_Name","Country1", "State2",31789933),
     |   ("Indus_3","Indus_3_Name","Country1", "State3",21789978),
     |   ("Indus_4","Indus_4_Name","Country2", "State1",41789978),
     |   ("Indus_4","Indus_4_Name","Country2", "State2",41789978),
     |   ("Indus_4","Indus_4_Name","Country2", "State2",81789978),
     |   ("Indus_4","Indus_4_Name","Country2", "State3",41789978),
     |   ("Indus_4","Indus_4_Name","Country2", "State3",51789978),
     |   ("Indus_5","Indus_5_Name","Country3", "State3",27789978),
     |   ("Indus_6","Indus_6_Name","Country1", "State1",27899790),
     |   ("Indus_7","Indus_7_Name","Country3", "State1",27899790),
     |   ("Indus_8","Indus_8_Name","Country1", "State2",27899790),
     |   ("Indus_9","Indus_9_Name","Country4", "State1",27899790)
     |   ).toDF("industry_id","industry_name","country","state","revenue")
df: org.apache.spark.sql.DataFrame = [industry_id: string, industry_name: string ... 3 more fields]

scala> val countryList = Seq("Country1","Country2","Country4","Country5");
countryList: Seq[String] = List(Country1, Country2, Country4, Country5)

scala> val stateMap = Map("Country1" -> ("State1","State2"), "Country2" -> ("State2","State3"),"Country3" -> ("State31","State32"));
stateMap: scala.collection.immutable.Map[String,(String, String)] = Map(Country1 -> (State1,State2), Country2 -> (State2,State3), Country3 -> (State31,State32))

scala>

scala> :paste
// Entering paste mode (ctrl-D to finish)

countryList
.par
.filter(cn => stateMap.exists(_._1 == cn))
.map(country => (country,stateMap(country)))
.map{data =>
    df.filter($"country" === data._1 && ($"state" === data._2._1 || $"state" === data._2._2)).groupBy("country","state","industry_name").agg(sum("revenue").as("total_revenue"))
}.reduce(_ union _).show(false)


// Exiting paste mode, now interpreting.

+--------+------+-------------+-------------+
|country |state |industry_name|total_revenue|
+--------+------+-------------+-------------+
|Country1|State2|Indus_8_Name |27899790     |
|Country1|State1|Indus_6_Name |27899790     |
|Country1|State2|Indus_2_Name |53579866     |
|Country1|State1|Indus_1_Name |12789979     |
|Country2|State3|Indus_4_Name |93579956     |
|Country2|State2|Indus_4_Name |123579956    |
+--------+------+-------------+-------------+


scala>

编辑 - 1：将 Agg 代码分隔到不同的 function 块中。

scala> def processDF(data:(String,(String,String)),adf:DataFrame) = adf.filter($"country" === data._1 && ($"state" === data._2._1 || $"state" === data._2._2)).groupBy("country","state","industry_name").agg(sum("revenue").as("total_revenue"))
processDF: (data: (String, (String, String)), adf: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame

scala> :paste
// Entering paste mode (ctrl-D to finish)

countryList.
par
.filter(cn => stateMap.exists(_._1 == cn))
.map(country => (country,stateMap(country)))
.map(data => processDF(data,df))
.reduce(_ union _)
.show(false)


// Exiting paste mode, now interpreting.

+--------+------+-------------+-------------+
|country |state |industry_name|total_revenue|
+--------+------+-------------+-------------+
|Country1|State2|Indus_8_Name |27899790     |
|Country1|State1|Indus_6_Name |27899790     |
|Country1|State2|Indus_2_Name |53579866     |
|Country1|State1|Indus_1_Name |12789979     |
|Country2|State3|Indus_4_Name |93579956     |
|Country2|State2|Indus_4_Name |123579956    |
+--------+------+-------------+-------------+


scala>

如何使用火花数据集 api 在火花中并行化

问题描述

3 个解决方案

解决方案1
1 2020-04-15 10:41:17

解决方案2
1 2020-04-15 10:43:41

解决方案3
1 已采纳 2020-04-20 17:39:54

如何使用火花数据集 api 在火花中并行化

问题描述

3 个解决方案

解决方案1 1 2020-04-15 10:41:17

解决方案2 1 2020-04-15 10:43:41

解决方案3 1 已采纳 2020-04-20 17:39:54

解决方案1
1 2020-04-15 10:41:17

解决方案2
1 2020-04-15 10:43:41

解决方案3
1 已采纳 2020-04-20 17:39:54