[英]How to load data into Product case class using Dataframe in Spark
I have a text file and has data like below: 我有一个文本文件,并且具有如下数据:
productId|price|saleEvent|rivalName|fetchTS
123|78.73|Special|VistaCart.com|2017-05-11 15:39:30
123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43
123|89.52|Sale|MarketPlace.com|2017-05-11 16:07:29
678|1348.73|Regular|VistaCart.com|2017-05-11 15:58:06
678|1348.73|Special|ShopYourWay.com|2017-05-11 15:44:22
678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03
777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01
I have to find minimum price of a product across websites, eg my output should be like this: 我必须找到各个网站上产品的最低价格,例如,我的输出应该是这样的:
productId|price|saleEvent|rivalName|fetchTS
123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43
678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03
777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01
I am trying like this: 我正在这样尝试:
case class Product(productId:String, price:Double, saleEvent:String, rivalName:String, fetchTS:String)
val cDF = spark.read.text("/home/prabhat/Documents/Spark/sampledata/competitor_data.txt")
val (header,values) = cDF.collect.splitAt(1)
values.foreach(x => Product(x(0).toString, x(1).toString.toDouble,
x(2).toString, x(3).toString, x(4).toString))
Getting exception while running last line: 运行最后一行时获取异常:
java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.spark.sql.catalyst.expressions.GenericRow
.get(rows.scala:174)
at org.apache.spark.sql.Row$class.apply(Row.scala:163)
at
org.apache.spark.sql.catalyst.expressions.GenericRow
.apply(rows.scala:166
)
at $anonfun$1.apply(<console>:28)
at $anonfun$1.apply(<console>:28)
at scala.collection.IndexedSeqOptimized$class.foreach
(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
... 49 elided
Priting value in values : 以值定价:
scala> values
res2: **Array[org.apache.spark.sql.Row]** = `
Array([123|78.73|Special|VistaCart.com|2017-05-11 15:39:30 ],
[123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43 ],
[123|89.52|Sale|MarketPlace.com|2017-05-11 16:07:29 ],
[678|1348.73|Regular|VistaCart.com|2017-05-11 15:58:06 ],
[678|1348.73|Special|ShopYourWay.com|2017-05-11 15:44:22 ],
[678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03 ],
[777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01 ]`
scala>
I am able to understand that I need to split("|")
. 我能够理解我需要split("|")
。
scala> val xy = values.foreach(x => x.toString.split("|").toSeq)
xy: Unit = ()
So after splitting its giving me Unit
class, ie void, so unable to load values into the Product
case class. 因此,在将其划分给我的Unit
类(即void)之后,因此无法将值加载到Product
case类中。 How can I load this Dataframe to Product
case class? 如何将此数据框加载到Product
案例类? I dont want to use Dataset for now, although Dataset is type safe. 我现在不想使用数据集,尽管数据集类型安全。
I'm using Spark 2.3 and Scala 2.11. 我正在使用Spark 2.3和Scala 2.11。
The issue is due to split
taking a regex, which means you need to use "\\\\|"
问题是由于使用正则表达式进行split
,这意味着您需要使用"\\\\|"
instead of a single "|"
而不是单个"|"
. 。 Also, the foreach
need to be changed to map
to actually give a return value, ie: 另外,需要更改foreach
来map
以实际给出一个返回值,即:
val xy = values.map(x => x.toString.split("\\|"))
However, a better approach would be to read the data as a csv file with |
但是,更好的方法是使用|
将数据读取为csv文件|
separators. 分隔符。 In this way you do not need to treat the header in a special way and by inferring the column types there is no need to make any convertions (here I changed fetchTS
to a timestamp): 这样,您就无需以特殊方式对待标头,并且不需要通过推断列类型就可以进行任何转换(这里我将fetchTS
更改为时间戳):
case class Product(productId: String, price: Double, saleEvent: String, rivalName: String, fetchTS: Timestamp)
val df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("sep", "|")
.csv("/home/prabhat/Documents/Spark/sampledata/competitor_data.txt")
.as[Product]
The final line will convert the dataframe to use the Product
case class. 最后一行将转换数据框以使用Product
案例类。 If you want to use it as an RDD instead, simply add .rdd
in the end. 如果要将其用作RDD,只需在末尾添加.rdd
。
After this is done, use groupBy
and agg
to get the final results. 完成此操作后,使用groupBy
和agg
获得最终结果。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.