简体繁体 English

Apache Spark处理能力和资格

[英]Apache Spark Processing Capabilites & Eligibility

原文 2017-06-11 06:40:37 2 2 json/ mongodb/ python-3.x/ apache-spark

I am new at Apache Spark and wonder whether it is eligible for my specific scenario or not. 我是Apache Spark的新手，想知道它是否符合我的特定方案。 In my case, I am crawling small datasets (as JSON files to MongoDB). 就我而言，我正在爬网小的数据集（作为JSON文件到MongoDB）。 Those files are actually related with the same entity but it is possible for them to have different structures (a specific JSON in the same collection may include more or less key/value pairs compared the other ones). 这些文件实际上与同一实体相关，但是它们可能具有不同的结构（与其他文件相比，同一集合中的特定JSON可能包含更多或更少的键/值对）。 What I am trying is to run Machine Learning (classification / regression) algorithms on those data files and derive info from them. 我正在尝试在这些数据文件上运行机器学习（分类/回归）算法并从中获取信息。

When you consider the case do you think Spark is eligible to speed things up by processing in parallel in a cluster environment? 考虑到这种情况，您认为Spark有资格通过在集群环境中并行处理来加快处理速度吗？ Or do you thing I should converge to some other alternatives? 还是您认为我应该转向其他替代方案？

Thank you. 谢谢。

2 个解决方案

Parallel processing is a way to go for the big data world of today. 并行处理是当今大数据世界的一种选择。 And considering your case, Spark is definitely a good choice. 考虑到您的情况， Spark绝对是一个不错的选择。 Spark is in-memory computation tool which operates with driver-executor scheme . Spark是in-memory computation tool ，可与driver-executor scheme 。 Memory is most important factor to consider for choosing spark . 记忆是选择spark最重要的因素。 you can check out Apache-spark 你可以看看Apache-spark

Since your project is related to machine learning, spark has a lot of libraries for machine learning mllib-guide 由于您的项目与机器学习有关，因此spark有许多用于机器学习的库mllib-guide

MongoDB is supported too. MongoDB也受支持。 you can check out databricks use case 您可以查看数据块用例

I hope this is helpful 我希望这是有帮助的

Yes, Apache Spark supports these kinds of use cases. 是的，Apache Spark支持此类用例。 You can directly read from your JSON files, if you want. 如果需要，您可以直接从JSON文件中读取。 MongoDB as a data source is also supported. 还支持将MongoDB作为数据源。 But, most importantly why you should use Spark is because it supports Machine Learning Algorithms directly on the datasets and you get parallel processing, fault tolerance, lazy loading and a lot more! 但是，最重要的是您应该使用Spark的原因是因为它直接在数据集上支持机器学习算法，并且您可以进行并行处理，容错，延迟加载等等！

Referencing directly from their Machine Learning Page - 直接从他们的机器学习页面引用-

Its goal is to make practical machine learning scalable and easy. 其目标是使实用的机器学习可扩展且容易。 At a high level, it provides tools such as: 在较高级别，它提供了以下工具：

ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering ML算法：常见的学习算法，例如分类，回归，聚类和协作过滤
Featurization: feature extraction, transformation, dimensionality reduction, and selection 特征化：特征提取，变换，降维和选择
Pipelines: tools for constructing, evaluating, and tuning ML Pipelines 管道：用于构建，评估和调整ML管道的工具
Persistence: saving and load algorithms, models, and Pipelines 持久性：保存和加载算法，模型和管道
Utilities: linear algebra, statistics, data handling, etc. 实用程序：线性代数，统计信息，数据处理等

Check out their page on Machine Learning for more details - http://spark.apache.org/docs/latest/ml-guide.html 查看他们在机器学习上的页面以了解更多详细信息-http: //spark.apache.org/docs/latest/ml-guide.html

MongoDB as a data source - https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html Loading JSON files from a folder directly - How to load directory of JSON files into Apache Spark in Python MongoDB中作为数据源- https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html直接从一个文件夹中的文件JSON - 如何的JSON文件的目录加载到Apache的Spark在蟒蛇

Moreover, it has API's in Python, R, Scala, and Java! 此外，它具有Python，R，Scala和Java的API！ Choose whatever you are comfortable in. 选择您喜欢的任何东西。