How do I use Datasets with BigInts?

Question

Try as I may, I cannot create a Dataset of a case class with enough precision to handle a DecimalType(38,0) .

I've tried:

case class BigId(id: scala.math.BigInt)

This runs into an error in the ExpressionEncoder https://issues.apache.org/jira/browse/SPARK-20341

I've tried:

case class BigId(id: java.math.BigDecimal)

But this runs into errors where the only possible precision is DecimalType(38,18) . I've even created my custom encoder, borrowing liberally from the spark source code . The biggest change is that I default the schema for java.math.BigDecimal to be DecimalType(38,0) . I couldn't find any reason to change the serializer or deserializer. When I provide my custom Encoder to Dataset.as or Dataset.map , I get the following stack trace:

User class threw exception: org.apache.spark.sql.AnalysisException: Cannot up cast `id` from decimal(38,0) to decimal(38,18) as it may truncate
The type path of the target object is:
- field (class: "java.math.BigDecimal", name: "id")
- root class: "BigId"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
org.apache.spark.sql.AnalysisException: Cannot up cast `id` from decimal(38,0) to decimal(38,18) as it may truncate
The type path of the target object is:
- field (class: "java.math.BigDecimal", name: "id")
- root class: "BigId"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:1998)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2020)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2015)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:285)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:357)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.immutable.List.map(List.scala:285)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:355)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:235)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:245)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:254)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:254)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:223)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2015)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2011)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.apply(Analyzer.scala:2011)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.apply(Analyzer.scala:1996)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
    at scala.collection.immutable.List.foldLeft(List.scala:84)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.resolveAndBind(ExpressionEncoder.scala:244)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:210)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
    at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
    at org.apache.spark.sql.Dataset.as(Dataset.scala:359)

I can confirm that both my input DataFrame.schema and my encoder.schema have a precision of DecimalType(38,0) . I've also removed any import spark.implicits._ , to confirm that the DataFrame methods are using my custom encoder.

At this point, it seems that the easiest option left is to pass the id around as a String. This seems wasteful.

Answer 1

While I admire your gumption in defining a custom encoder, it is unnecessary. Your value is an ID--not something you intend to use as a number . In other words, you aren't going to use it for calculations. You are simply turning a String into a BigId for the sole purpose of perceived optimization.

As the legendary Donald Knuth once wrote: " Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil . "

So solve efficiency problems when they actually happen . Right now, you have a solution looking for a problem--and not even a working solution at that after spending a lot of time that should've been spent on the quality of your analytics.

As for the efficiency of using String as a general matter, rely instead on the Tungsten optimizations the Spark team has worked very hard at under the hood, and keep your eye on the ball

How do I use Datasets with BigInts?

Question

1 answers

solution1
0 2017-04-17 16:31:08

How do I use Datasets with BigInts?

Question

1 answers

solution1 0 2017-04-17 16:31:08

solution1
0 2017-04-17 16:31:08