Dataset api of Spark giving different result as compare to Dataframe

Question

I am using Spark 2.1 and having one hive table with orc format, following is the schema.

col_name    data_type
tuid        string
puid        string
ts          string
dt          string
source      string
peer        string
# Partition Information 
# col_name  data_type
dt          string
source      string
peer        string

# Detailed Table Information    
Database:           test
Owner:              test
Create Time:        Tue Nov 22 15:25:53 GMT 2016
Last Access Time:   Thu Jan 01 00:00:00 GMT 1970
Location:           hdfs://apps/hive/warehouse/nis.db/dmp_puid_tuid
Table Type:         MANAGED
Table Parameters:   
  transient_lastDdlTime 1479828353
  SORTBUCKETCOLSPREFIX  TRUE

# Storage Information   
SerDe Library:  org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat:    org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat:   org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Storage Desc Parameters:    
  serialization.format  1

When i am applying filter on top of this table using partition column, its working fine and only reading specific partitions.

val puid = spark.read.table("nis.dmp_puid_tuid")
    .as(Encoders.bean(classOf[DmpPuidTuid]))
    .filter( """peer = "AggregateKnowledge" and dt = "20170403"""")

and this is my physical plan for this query

== Physical Plan ==
HiveTableScan [tuid#1025, puid#1026, ts#1027, dt#1022, source#1023, peer#1024], MetastoreRelation nis, dmp_puid_tuid, [isnotnull(peer#1024), isnotnull(dt#1022), 
(peer#1024 = AggregateKnowledge), (dt#1022 = 20170403)]

but when i am using below code, its reading entire data into spark

val puid = spark.read.table("nis.dmp_puid_tuid")
    .as(Encoders.bean(classOf[DmpPuidTuid]))
    .filter( tp => tp.getPeer().equals("AggregateKnowledge") && Integer.valueOf(tp.getDt()) >= 20170403)

Physical plan for above dataframe

== Physical Plan ==
*Filter <function1>.apply
+- HiveTableScan [tuid#1058, puid#1059, ts#1060, dt#1055, source#1056, peer#1057], MetastoreRelation nis, dmp_puid_tuid

Note :- DmpPuidTuid is java bean class

Answer 1

When you pass a Scala function to filter , you prevent the Spark optimizer from seeing which columns of the dataset are actually used (because the optimizer does not try to look inside the compiled code of the function). If you pass a column expression, such as col("peer") === "AggregateKnowledge" && col("dt").cast(IntegerType) >= 20170403 then the optimizer will be able to see which columns are actually required and adjust the plan accordingly.

Dataset api of Spark giving different result as compare to Dataframe

Question

1 answers

solution1
0 ACCPTED 2017-06-03 04:51:25

Dataset api of Spark giving different result as compare to Dataframe

Question

1 answers

solution1 0 ACCPTED 2017-06-03 04:51:25

solution1
0 ACCPTED 2017-06-03 04:51:25