Hello I am working on a project where I have to pull data between 2018 and 2023. It's about 200 million records (not that many), But now I am confused with these two approaches to load data.
I did the query in oracle DB and it takes between 80 to 150 seconds.
Then I tried it like this with spark:
DXHS_FACTURACION_CONSUMOS = spark.read \
.format("jdbc") \
.option("url", url_SGC_ORACLE) \
.option("dbtable", 'DXHS_FACTURACION_CONSUMOS') \
.option("user", username) \
.option("password", password) \
.option("driver", driver_oracle) \
.load().select('ID_SERVICIO','ID_MES','CS_FACTURADO','CS_ANULADO','CS_REFACTURADO','IN_FACTURADO','CC_DIAS_FACTURADO','FE_FACTURA_ULTIMA') \
DXHS_FACTURACION_CONSUMOS.show()
It has been running for about 20 minutes and it does not finish, then I tried:
query2 = """
SELECT ID_SERVICIO, ID_MES, CS_FACTURADO, CS_ANULADO, CS_REFACTURADO, IN_FACTURADO, CC_DIAS_FACTURADO, FE_FACTURA_ULTIMA FROM DXHS_FACTURACION_CONSUMOS WHERE ID_MES >=201801
"""
DXHS_FACTURACION_CONSUMOS = spark.read \
.format("jdbc") \
.option("url", url_SGC_ORACLE) \
.option("query", query2) \
.option("user", username) \
.option("password", password) \
.option("driver", driver_oracle) \
.load().alias('DXHS_FACTURACION_CONSUMOS')
DXHS_FACTURACION_CONSUMOS.show()
It takes about 90 seconds to finish.
Does the first example load the whole table and then it start filtering? while the second example filter first using the database and loads only the required data to spark? or why is it so big of a difference?
Thank you for the knowledge.
Update:
I am giving more context as some of you guys wanted it.
I am using spark 3.3.0 LOCAL mode. I tried to do the parallel reading as Kashyap mentioned but it looks like it only works in cluster mode and i would have to read the whole table.
This is an example of the table I am working with, i have data from 2000 or earlier but i just need from 2018 and onward.
I need to do a join with a similar table but with data from 2018 and onward.
ID_SERVICIO | ID_MES | CS_FACTURADO | CS_ANULADO | CS_REFACTURADO | IN_FACTURADO | CC_DIAS_FACTURADO | FE_FACTURA_ULTIMA | TI_TARIFA |
---|---|---|---|---|---|---|---|---|
1 | 200001 | 324 | 0 | 0 | S | 31 | 2022-04-04 00:00:00 | c45 |
2 | 201801 | 425 | 20 | 0 | S | 31 | 2020-04-04 00:00:00 | g56 |
3 | 202212 | 645 | 0 | 56 | S | 28 | 2020-04-04 00:00:00 | f78 |
thanks!
.option("dbtable", 'DXHS_FACTURACION_CONSUMOS')
translates to select * from DXHS_FACTURACION_CONSUMOS
. .option("query", query2)
executes query2
on DB, SELECT <columns>.. WHERE ID_MES >=201801
. Hence the first one is slower and second one is much faster, probably because of the where clause.
You're not using anything specific to Spark here.
If you do want to read large amount of data faster then use partitionColumn
tomake Spark run multiple select queries in parallel. Eg you might be able to use year as the partitionColumn
if you have it.
--- edit ---
it looks like it only works in cluster mode with multiple workers
Yes and no. When you use partitionColumn
in your code, spark generates a plan that contains multiple tasks (lets say 100), one each to execute a query like:
select * from table where partitionColumn > 0 and partitionColumn <= 100
select * from table where partitionColumn > 100 and partitionColumn <= 200
select * from table where partitionColumn > 9900 and partitionColumn <= 10000
When it comes to actually executing this plan:
Update: just realized you've a where condition as well in your Query, but same is not there in former. This means you're effectively doing select * there, without any filter condition.
As for columns,
Probably you're using Spark Version below 3 and expecting Spark to take care of selecting only required columns. Spark 2.4 had predicate filters pushdown but lacked projection pushdown. Let me know if this is not the case, I can help you investigate
Predicate Pushdown refers to where, filter, IN, like etc clauses which affects the number of rows returned. It basically is row based filtering.
As opposed to this, Spark 3 introduced Projection Pushdown which affects the number of columns returned. Thus this is column based filtering.
Thus in you code, with the query approach you're explicitly defining the query (and hence columns) to be used for reading. However when specifying just the table Spark will read the all the columns (ie No column pruning) even if later only specefic columns are used. You can check this behavior using .explain()
From Spark 3, Spark takes care of doing the same.
See: What is the difference between "predicate pushdown" and "projection pushdown"?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.