[英]scala how to parameterized case class, and pass the case class variable to [T <: Product: TypeTag]
// class definition of RsGoods schema
case class RsGoods(add_time: Int)
// my operation
originRDD.toDF[Schemas.RsGoods]()
// and the function definition
def toDF[T <: Product: TypeTag](): DataFrame = mongoSpark.toDF[T]()
now i defined too many schemas(RsGoods1,RsGoods2,RsGoods3), and more will be added in the future. 现在我定义了太多的架构(RsGoods1,RsGoods2,RsGoods3),将来还会添加更多。
so the question is how to pass a case class as a variable to structure the code 所以问题是如何将案例类作为变量传递以构造代码
Attach sbt dependency 附加sbt依赖
"org.apache.spark" % "spark-core_2.11" % "2.3.0",
"org.apache.spark" %% "spark-sql" % "2.3.0",
"org.mongodb.spark" %% "mongo-spark-connector" % "2.3.1",
Attach the key code snippet 附上关键代码片段
var originRDD = MongoSpark.load(sc, readConfig)
val df = table match {
case "rs_goods_multi" => originRDD.toDF[Schemas.RsGoodsMulti]()
case "rs_goods" => originRDD.toDF[Schemas.RsGoods]()
case "ma_item_price" => originRDD.toDF[Schemas.MaItemPrice]()
case "ma_siteuid" => originRDD.toDF[Schemas.MaSiteuid]()
case "pi_attribute" => originRDD.toDF[Schemas.PiAttribute]()
case "pi_attribute_name" => originRDD.toDF[Schemas.PiAttributeName]()
case "pi_attribute_value" => originRDD.toDF[Schemas.PiAttributeValue]()
case "pi_attribute_value_name" => originRDD.toDF[Schemas.PiAttributeValueName]()
From what I have understood about your requirement, i think following should be a decent starting point. 根据我对您的要求的了解,我认为跟随应该是一个不错的起点。
def readDataset[A: Encoder](
spark: SparkSession,
mongoUrl: String,
collectionName: String,
clazz: Class[A]
): Dataset[A] = {
val config = ReadConfig(
Map("uri" -> s"$mongoUrl.$collectionName")
)
val df = MongoSpark.load(spark, config)
val fieldNames = clazz.getDeclaredFields.map(f => f.getName).dropRight(1).toList
val dfWithMatchingFieldNames = df.toDf(fieldNames: _*)
dfWithMatchingFieldNames.as[A]
}
You can use it like this, 你可以这样使用
case class RsGoods(add_time: Int)
val spark: SparkSession = ...
import spark.implicts._
val rdGoodsDS = readDataset[RsGoods](
spark,
"mongodb://example.com/database",
"rs_goods",
classOf[RsGoods]
)
Also, the following two lines, 另外,以下两行
val fieldNames = clazz.getDeclaredFields.map(f => f.getName).dropRight(1).toList
val dfWithMatchingFieldNames = df.toDf(fieldNames: _*)
are only required because normally Spark reads DataFrames with column names like value1, value2, ...
. 仅因为正常情况下Spark读取具有诸如value1, value2, ...
类的列名称的DataFrames才需要。 So we want to change the column names to match what we have in our case class
. 因此,我们想更改列名以匹配case class
的列名。
I am not sure what these "defalut" column names will be because MongoSpark is involved. 我不确定这些“ defalut”列的名称是什么,因为涉及到MongoSpark。
You should first check the column names in the df created as following, 您首先应检查按以下方式创建的df中的列名称,
val config = ReadConfig(
Map("uri" -> s"$mongoUrl.$collectionName")
)
val df = MongoSpark.load(spark, config)
If, MongoSpark fixes the problem of these "default" column names and picks the coulmn names from your collection then those 2 lines will not be required and your method will become just this, 如果MongoSpark解决了这些“默认”列名称的问题并从您的集合中选择了库伦名称,那么将不需要这两行,并且您的方法将变成这样,
def readDataset[A: Encoder](
spark: SparkSession,
mongoUrl: String,
collectionName: String,
): Dataset[A] = {
val config = ReadConfig(
Map("uri" -> s"$mongoUrl.$collectionName")
)
val df = MongoSpark.load(spark, config)
df.as[A]
}
And, 和,
val rsGoodsDS = readDataset[RsGoods](
spark,
"mongodb://example.com/database",
"rs_goods"
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.