简体   繁体   中英

Is MySQL more efficient in query optimization and general efficiency than Apache spark

I find that Apache spark is much slower then a MySQL server for the same query and the same table query on a spark data frame.

So where would be spark more efficient then MySQL?

Note : tried on a table with 1 million rows all of 10 columns of type text.

The size of table in json is about 10GB

Using a standalone pyspark notebook with Xeon 16 core and 64gb RAM and on same server MySql

In general I would like to know guidelines on when to use SPARK vs SQL server in terms of the size of target data to get real snappy results from analytic queries.

Ok, so going to try and help here even though it's still very difficult to answer this without knowing more. Assuming there is no contention for resources, there are a number of things that are going on here. If you're running on yarn and your json is stored in hdfs. It is likely split into many blocks, those blocks are then processed in different partitions. Since json doesn't split very well, you'd lose alot of parallel capabilities. Also, spark isn't meant to really have the super low latency queries like a tuned rdbms. Where you benefit from spark is on heavy data processing, large amounts of data (TB or PB). If you are looking for low latency queries you should use Impala or Hive with Tez. You should also consider changing your file format to avro, parquet or ORC.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM