简体   繁体   English

将spark RDD用作REST API中的数据源

[英]Use spark RDD as a source of data in a REST API

There is a graph that computes on Spark and stores to Cassandra. 有一个在Spark上计算并存储到Cassandra的图形。
Also there is a REST API which has endpoint to get graph node with edges and edges of edges. 还有一个REST API,该API的端点具有获取带有边缘和边缘的图节点的端点。
This second degree graph may include up to 70000 nodes. 该第二度图可以包括多达70000个节点。
Currently uses Cassandra as the database, but to extract a lot of data by key from Cassandra takes much time and resources. 当前使用Cassandra作为数据库,但是通过键从Cassandra提取大量数据需要大量时间和资源。
We tried TitanDB, Neo4j and OriendDB to improve performance but Cassandra showed the best results. 我们尝试了TitanDB,Neo4j和OriendDB来提高性能,但Cassandra表现出最好的结果。

Now there is another idea. 现在有另一个想法。 Persist RDD (or may be GrapgX object) in the API service and on API call filter necessary data from persisted RDD. 在API服务中以及在API调用中,持久化RDD(或可能是GrapgX对象)来自持久化RDD的必要数据。
I guess that it will work fast while RDD fits in memory, but in the case that it caches to disk it will work like a full scan (eg full scan parquet file). 我猜想它会在RDD装入内存的情况下快速运行,但是如果它缓存到磁盘,它将像完整扫描(例如完整扫描拼花文件)一样工作。 Also I expect that we will face to these issues: 我也希望我们将面对这些问题:

  • memory leak in spark; 火花中的内存泄漏;
  • updating this RDD (unpersist previous, read new and persist new one) will require stop API; 更新此RDD(不保留先前的内容,阅读新内容并保留新内容)将需要停止API;
  • concurrent using this RDD will require manually manage CPU resources. 使用此RDD并发将需要手动管理CPU资源。

Do anybody have such experience? 有人有这样的经验吗?

Spark is NOT a storage engine. Spark不是存储引擎。 Unless you will process big amount of data each time, you should consider: 除非您每次都会处理大量数据,否则应考虑:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM