简体   繁体   中英

spark spark.read().load().select().filter() vs spark.read().option(query) BIG time diference

Hello I am working on a project where I have to pull data between 2018 and 2023. It's about 200 million records (not that many), But now I am confused with these two approaches to load data.

I did the query in oracle DB and it takes between 80 to 150 seconds.

Then I tried it like this with spark:

    DXHS_FACTURACION_CONSUMOS = spark.read \
        .format("jdbc") \
        .option("url", url_SGC_ORACLE) \
        .option("dbtable", 'DXHS_FACTURACION_CONSUMOS') \
        .option("user", username) \
        .option("password", password) \
        .option("driver", driver_oracle) \
        .load().select('ID_SERVICIO','ID_MES','CS_FACTURADO','CS_ANULADO','CS_REFACTURADO','IN_FACTURADO','CC_DIAS_FACTURADO','FE_FACTURA_ULTIMA') \
DXHS_FACTURACION_CONSUMOS.show()

It has been running for about 20 minutes and it does not finish, then I tried:

query2 = """
  SELECT ID_SERVICIO, ID_MES, CS_FACTURADO, CS_ANULADO, CS_REFACTURADO, IN_FACTURADO, CC_DIAS_FACTURADO, FE_FACTURA_ULTIMA FROM DXHS_FACTURACION_CONSUMOS WHERE ID_MES >=201801
"""

DXHS_FACTURACION_CONSUMOS = spark.read \
    .format("jdbc") \
    .option("url", url_SGC_ORACLE) \
    .option("query", query2) \
    .option("user", username) \
    .option("password", password) \
    .option("driver", driver_oracle) \
    .load().alias('DXHS_FACTURACION_CONSUMOS')



DXHS_FACTURACION_CONSUMOS.show()

It takes about 90 seconds to finish.

Does the first example load the whole table and then it start filtering? while the second example filter first using the database and loads only the required data to spark? or why is it so big of a difference?

Thank you for the knowledge.

Update:

I am giving more context as some of you guys wanted it.

I am using spark 3.3.0 LOCAL mode. I tried to do the parallel reading as Kashyap mentioned but it looks like it only works in cluster mode and i would have to read the whole table.

This is an example of the table I am working with, i have data from 2000 or earlier but i just need from 2018 and onward.

I need to do a join with a similar table but with data from 2018 and onward.

ID_SERVICIO ID_MES CS_FACTURADO CS_ANULADO CS_REFACTURADO IN_FACTURADO CC_DIAS_FACTURADO FE_FACTURA_ULTIMA TI_TARIFA
1 200001 324 0 0 S 31 2022-04-04 00:00:00 c45
2 201801 425 20 0 S 31 2020-04-04 00:00:00 g56
3 202212 645 0 56 S 28 2020-04-04 00:00:00 f78
  • It does not make sense to me that using select() and filter() has a different execution plan but it looks like it.
  • As the tables are so big and i can't read in parallel because i have only one worker I did cache on both tables and then started working, but I do no know if it is better to cache after the first join is done. (I will test it)

thanks!

  • .option("dbtable", 'DXHS_FACTURACION_CONSUMOS') translates to select * from DXHS_FACTURACION_CONSUMOS .
  • .option("query", query2) executes query2 on DB, SELECT <columns>.. WHERE ID_MES >=201801 .

Hence the first one is slower and second one is much faster, probably because of the where clause.

You're not using anything specific to Spark here.

If you do want to read large amount of data faster then use partitionColumn tomake Spark run multiple select queries in parallel. Eg you might be able to use year as the partitionColumn if you have it.

--- edit ---

it looks like it only works in cluster mode with multiple workers

Yes and no. When you use partitionColumn in your code, spark generates a plan that contains multiple tasks (lets say 100), one each to execute a query like:

  • select * from table where partitionColumn > 0 and partitionColumn <= 100
  • select * from table where partitionColumn > 100 and partitionColumn <= 200
  • ...
  • select * from table where partitionColumn > 9900 and partitionColumn <= 10000

When it comes to actually executing this plan:

  • if the HW you're running on has multiple executors (nodes) then the driver will execute these tasks in parallel.
  • if the HW you're running on has a single executor (nodes) then the driver will execute these tasks one at a time.

Update: just realized you've a where condition as well in your Query, but same is not there in former. This means you're effectively doing select * there, without any filter condition.

As for columns,

Probably you're using Spark Version below 3 and expecting Spark to take care of selecting only required columns. Spark 2.4 had predicate filters pushdown but lacked projection pushdown. Let me know if this is not the case, I can help you investigate

Predicate Pushdown refers to where, filter, IN, like etc clauses which affects the number of rows returned. It basically is row based filtering.

As opposed to this, Spark 3 introduced Projection Pushdown which affects the number of columns returned. Thus this is column based filtering.

Thus in you code, with the query approach you're explicitly defining the query (and hence columns) to be used for reading. However when specifying just the table Spark will read the all the columns (ie No column pruning) even if later only specefic columns are used. You can check this behavior using .explain()

From Spark 3, Spark takes care of doing the same.

See: What is the difference between "predicate pushdown" and "projection pushdown"?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM