简体   繁体   English

如何在Spark中使用Dataframe将数据加载到Product case类中

[英]How to load data into Product case class using Dataframe in Spark

I have a text file and has data like below: 我有一个文本文件,并且具有如下数据:

productId|price|saleEvent|rivalName|fetchTS 
123|78.73|Special|VistaCart.com|2017-05-11 15:39:30 
123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43 
123|89.52|Sale|MarketPlace.com|2017-05-11 16:07:29 
678|1348.73|Regular|VistaCart.com|2017-05-11 15:58:06 
678|1348.73|Special|ShopYourWay.com|2017-05-11 15:44:22 
678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03 
777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01 

I have to find minimum price of a product across websites, eg my output should be like this: 我必须找到各个网站上产品的最低价格,例如,我的输出应该是这样的:

productId|price|saleEvent|rivalName|fetchTS 
123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43 
678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03 
777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01 

I am trying like this: 我正在这样尝试:

case class Product(productId:String, price:Double, saleEvent:String, rivalName:String, fetchTS:String)

val cDF = spark.read.text("/home/prabhat/Documents/Spark/sampledata/competitor_data.txt")
val (header,values) = cDF.collect.splitAt(1)
values.foreach(x => Product(x(0).toString, x(1).toString.toDouble, 
x(2).toString, x(3).toString, x(4).toString))

Getting exception while running last line: 运行最后一行时获取异常:

 java.lang.ArrayIndexOutOfBoundsException: 1
 at org.apache.spark.sql.catalyst.expressions.GenericRow
 .get(rows.scala:174)
 at org.apache.spark.sql.Row$class.apply(Row.scala:163)
 at 
 org.apache.spark.sql.catalyst.expressions.GenericRow
 .apply(rows.scala:166
 )
 at $anonfun$1.apply(<console>:28)
 at $anonfun$1.apply(<console>:28)
 at scala.collection.IndexedSeqOptimized$class.foreach
 (IndexedSeqOptimized.scala:33)
 at 
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
 ... 49 elided

Priting value in values : 定价:

scala> values
res2: **Array[org.apache.spark.sql.Row]** = ` 
Array([123|78.73|Special|VistaCart.com|2017-05-11 15:39:30 ], 
[123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43 ], 
[123|89.52|Sale|MarketPlace.com|2017-05-11 16:07:29 ], 
[678|1348.73|Regular|VistaCart.com|2017-05-11 15:58:06 ], 
[678|1348.73|Special|ShopYourWay.com|2017-05-11 15:44:22 ], 
[678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03 ], 
[777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01 ]`
scala> 

I am able to understand that I need to split("|") . 我能够理解我需要split("|")

scala> val xy = values.foreach(x => x.toString.split("|").toSeq)
xy: Unit = ()

So after splitting its giving me Unit class, ie void, so unable to load values into the Product case class. 因此,在将其划分给我的Unit类(即void)之后,因此无法将加载到Product case类中。 How can I load this Dataframe to Product case class? 如何将此数据框加载到Product案例类? I dont want to use Dataset for now, although Dataset is type safe. 我现在不想使用数据集,尽管数据集类型安全。

I'm using Spark 2.3 and Scala 2.11. 我正在使用Spark 2.3和Scala 2.11。

The issue is due to split taking a regex, which means you need to use "\\\\|" 问题是由于使用正则表达式进行split ,这意味着您需要使用"\\\\|" instead of a single "|" 而不是单个"|" . Also, the foreach need to be changed to map to actually give a return value, ie: 另外,需要更改foreachmap以实际给出一个返回值,即:

val xy = values.map(x => x.toString.split("\\|"))

However, a better approach would be to read the data as a csv file with | 但是,更好的方法是使用|将数据读取为csv文件| separators. 分隔符。 In this way you do not need to treat the header in a special way and by inferring the column types there is no need to make any convertions (here I changed fetchTS to a timestamp): 这样,您就无需以特殊方式对待标头,并且不需要通过推断列类型就可以进行任何转换(这里我将fetchTS更改为时间戳):

case class Product(productId: String, price: Double, saleEvent: String, rivalName: String, fetchTS: Timestamp)

val df = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .option("sep", "|")
  .csv("/home/prabhat/Documents/Spark/sampledata/competitor_data.txt")
  .as[Product]

The final line will convert the dataframe to use the Product case class. 最后一行将转换数据框以使用Product案例类。 If you want to use it as an RDD instead, simply add .rdd in the end. 如果要将其用作RDD,只需在末尾添加.rdd

After this is done, use groupBy and agg to get the final results. 完成此操作后,使用groupByagg获得最终结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用FASTLOAD通过Spark数据帧将数据加载到Teradata表中 - How to load data into Teradata table using FASTLOAD through Spark dataframe 使用案例类通过Spark Dataframe重命名拆分列 - Using a case class to rename split columns with Spark Dataframe 使用案例类 spark scala 将数据帧转换为数据集 - Transform a dataframe to a dataset using case class spark scala 无法在火花 scala 中使用案例 class 从文本文件创建 dataframe - Unable to create dataframe from a textfile using case class in spark scala 使用带有选项字段的案例类将数据帧转换为数据集 - spark convert dataframe to dataset using case class with option fields 如何使用案例类将简单的DataFrame转换为DataSet Spark Scala? - How to convert a simple DataFrame to a DataSet Spark Scala with case class? 如何根据 scala/spark 中的案例 class 更改 dataframe 中列的数据类型 - How to change datatype of columns in a dataframe based on a case class in scala/spark 如何从案例 class 值重命名现有火花 dataframe - How to rename existing spark dataframe from case class values 如何使用嵌套的案例类架构模拟Spark Scala DataFrame? - How to mock a Spark Scala DataFrame with a nested case-class schema? 如何不使用Case类创建DataFrame? - How to create DataFrame not using Case Class?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM