简体繁体中英

Apache Spark Processing Capabilites & Eligibility

原文 2017-06-11 06:40:37 9 2 json/ mongodb/ python-3.x/ apache-spark

I am new at Apache Spark and wonder whether it is eligible for my specific scenario or not. In my case, I am crawling small datasets (as JSON files to MongoDB). Those files are actually related with the same entity but it is possible for them to have different structures (a specific JSON in the same collection may include more or less key/value pairs compared the other ones). What I am trying is to run Machine Learning (classification / regression) algorithms on those data files and derive info from them.

When you consider the case do you think Spark is eligible to speed things up by processing in parallel in a cluster environment? Or do you thing I should converge to some other alternatives?

Thank you.

2 answers

Parallel processing is a way to go for the big data world of today. And considering your case, Spark is definitely a good choice. Spark is in-memory computation tool which operates with driver-executor scheme . Memory is most important factor to consider for choosing spark . you can check out Apache-spark

Since your project is related to machine learning, spark has a lot of libraries for machine learning mllib-guide

MongoDB is supported too. you can check out databricks use case

I hope this is helpful

Yes, Apache Spark supports these kinds of use cases. You can directly read from your JSON files, if you want. MongoDB as a data source is also supported. But, most importantly why you should use Spark is because it supports Machine Learning Algorithms directly on the datasets and you get parallel processing, fault tolerance, lazy loading and a lot more!

Referencing directly from their Machine Learning Page -

Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:

ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
Featurization: feature extraction, transformation, dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
Persistence: saving and load algorithms, models, and Pipelines
Utilities: linear algebra, statistics, data handling, etc.

Check out their page on Machine Learning for more details - http://spark.apache.org/docs/latest/ml-guide.html

MongoDB as a data source - https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html Loading JSON files from a folder directly - How to load directory of JSON files into Apache Spark in Python

Moreover, it has API's in Python, R, Scala, and Java! Choose whatever you are comfortable in.

SPARK processing of polymorphic JSON

Post processing json in Apache cxf

Filtering JSON with apache spark (NO SPARK SQL) - Scala

Processing JSON in spark - different schemas in different files

Spark processing json data with hundreds of colums

Read multiline JSON in Apache Spark

Reading JSON Array with Apache Spark

Weird error while parsing JSON in Apache Spark

Extract and explode embedded json fields in apache spark

Convert JSON Data into DataFrame Apache Spark

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question SPARK processing of polymorphic JSON Post processing json in Apache cxf Filtering JSON with apache spark (NO SPARK SQL) - Scala Processing JSON in spark - different schemas in different files Spark processing json data with hundreds of colums Read multiline JSON in Apache Spark Reading JSON Array with Apache Spark Weird error while parsing JSON in Apache Spark Extract and explode embedded json fields in apache spark Convert JSON Data into DataFrame Apache Spark

Related Tags

Apache Spark Processing Capabilites & Eligibility

Question

2 answers

solution1
0 2017-06-11 10:25:20

solution2
0 2017-06-11 14:01:50

Apache Spark Processing Capabilites & Eligibility

Question

2 answers

solution1 0 2017-06-11 10:25:20

solution2 0 2017-06-11 14:01:50

solution1
0 2017-06-11 10:25:20

solution2
0 2017-06-11 14:01:50