如何在Spark中使用Dataframe将数据加载到Product case类中

Question

I have a text file and has data like below: 我有一个文本文件，并且具有如下数据：

productId|price|saleEvent|rivalName|fetchTS 
123|78.73|Special|VistaCart.com|2017-05-11 15:39:30 
123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43 
123|89.52|Sale|MarketPlace.com|2017-05-11 16:07:29 
678|1348.73|Regular|VistaCart.com|2017-05-11 15:58:06 
678|1348.73|Special|ShopYourWay.com|2017-05-11 15:44:22 
678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03 
777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01

I have to find minimum price of a product across websites, eg my output should be like this: 我必须找到各个网站上产品的最低价格，例如，我的输出应该是这样的：

productId|price|saleEvent|rivalName|fetchTS 
123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43 
678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03 
777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01

I am trying like this: 我正在这样尝试：

case class Product(productId:String, price:Double, saleEvent:String, rivalName:String, fetchTS:String)

val cDF = spark.read.text("/home/prabhat/Documents/Spark/sampledata/competitor_data.txt")
val (header,values) = cDF.collect.splitAt(1)
values.foreach(x => Product(x(0).toString, x(1).toString.toDouble, 
x(2).toString, x(3).toString, x(4).toString))

Getting exception while running last line: 运行最后一行时获取异常：

 java.lang.ArrayIndexOutOfBoundsException: 1
 at org.apache.spark.sql.catalyst.expressions.GenericRow
 .get(rows.scala:174)
 at org.apache.spark.sql.Row$class.apply(Row.scala:163)
 at 
 org.apache.spark.sql.catalyst.expressions.GenericRow
 .apply(rows.scala:166
 )
 at $anonfun$1.apply(<console>:28)
 at $anonfun$1.apply(<console>:28)
 at scala.collection.IndexedSeqOptimized$class.foreach
 (IndexedSeqOptimized.scala:33)
 at 
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
 ... 49 elided

Priting value in values : 以值定价：

scala> values
res2: **Array[org.apache.spark.sql.Row]** = ` 
Array([123|78.73|Special|VistaCart.com|2017-05-11 15:39:30 ], 
[123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43 ], 
[123|89.52|Sale|MarketPlace.com|2017-05-11 16:07:29 ], 
[678|1348.73|Regular|VistaCart.com|2017-05-11 15:58:06 ], 
[678|1348.73|Special|ShopYourWay.com|2017-05-11 15:44:22 ], 
[678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03 ], 
[777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01 ]`
scala>

I am able to understand that I need to split("|") . 我能够理解我需要split("|") 。

scala> val xy = values.foreach(x => x.toString.split("|").toSeq)
xy: Unit = ()

So after splitting its giving me Unit class, ie void, so unable to load values into the Product case class. 因此，在将其划分给我的Unit类（即void）之后，因此无法将值加载到Product case类中。 How can I load this Dataframe to Product case class? 如何将此数据框加载到Product案例类？ I dont want to use Dataset for now, although Dataset is type safe. 我现在不想使用数据集，尽管数据集类型安全。

I'm using Spark 2.3 and Scala 2.11. 我正在使用Spark 2.3和Scala 2.11。

Answer 1

The issue is due to split taking a regex, which means you need to use "\\\\|" 问题是由于使用正则表达式进行split ，这意味着您需要使用"\\\\|" instead of a single "|" 而不是单个"|" . 。 Also, the foreach need to be changed to map to actually give a return value, ie: 另外，需要更改foreach来map以实际给出一个返回值，即：

val xy = values.map(x => x.toString.split("\\|"))

However, a better approach would be to read the data as a csv file with | 但是，更好的方法是使用|将数据读取为csv文件| separators. 分隔符。 In this way you do not need to treat the header in a special way and by inferring the column types there is no need to make any convertions (here I changed fetchTS to a timestamp): 这样，您就无需以特殊方式对待标头，并且不需要通过推断列类型就可以进行任何转换（这里我将fetchTS更改为时间戳）：

case class Product(productId: String, price: Double, saleEvent: String, rivalName: String, fetchTS: Timestamp)

val df = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .option("sep", "|")
  .csv("/home/prabhat/Documents/Spark/sampledata/competitor_data.txt")
  .as[Product]

The final line will convert the dataframe to use the Product case class. 最后一行将转换数据框以使用Product案例类。 If you want to use it as an RDD instead, simply add .rdd in the end. 如果要将其用作RDD，只需在末尾添加.rdd 。

After this is done, use groupBy and agg to get the final results. 完成此操作后，使用groupBy和agg获得最终结果。

如何在Spark中使用Dataframe将数据加载到Product case类中

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-05-31 04:00:00

如何在Spark中使用Dataframe将数据加载到Product case类中

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-05-31 04:00:00

解决方案1
0 已采纳 2018-05-31 04:00:00