简体   繁体   中英

Use spark RDD as a source of data in a REST API

There is a graph that computes on Spark and stores to Cassandra.
Also there is a REST API which has endpoint to get graph node with edges and edges of edges.
This second degree graph may include up to 70000 nodes.
Currently uses Cassandra as the database, but to extract a lot of data by key from Cassandra takes much time and resources.
We tried TitanDB, Neo4j and OriendDB to improve performance but Cassandra showed the best results.

Now there is another idea. Persist RDD (or may be GrapgX object) in the API service and on API call filter necessary data from persisted RDD.
I guess that it will work fast while RDD fits in memory, but in the case that it caches to disk it will work like a full scan (eg full scan parquet file). Also I expect that we will face to these issues:

  • memory leak in spark;
  • updating this RDD (unpersist previous, read new and persist new one) will require stop API;
  • concurrent using this RDD will require manually manage CPU resources.

Do anybody have such experience?

Spark is NOT a storage engine. Unless you will process big amount of data each time, you should consider:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM