简体繁体中英

Data locality in Spark Streaming

原文 2015-07-21 11:30:45 4 1 apache-spark/ real-time/ bigdata/ distributed-computing/ spark-streaming

Recently I've been doing performance tests on Spark Streaming. I ran a receiver on one of the 6 slaves and submitted a simple Word Count application to the cluster(actually I know this configuration is not proper in practice,just a simple test).I analyzed the scheduling log and found that nearly 88% of tasks are scheduled to the node where receiver ran on and the locality are always PROCESS_LOCAL and the CPU utilization is very high. Why does not Spark Streaming distribute data across the cluster and make full use of cluster? I've read official guide and it does not explain in detail, especially in Spark Streaming. Will it copy stream data to another node with free CPU and start new task on it when a task is on a node with busy CPU? If so, how can we explain the former case?

1 answers

When you run the stream receiver just on one of the 6 nodes, all the received data are processed on this node (that is the data locality ).

Data are not distributed across other nodes by default. If you need the input stream to be repartitioned (balanced across cluster) before further processing, you can use

inputStream.repartition(<number of partitions>)

This distributes the received batches of data across the specified number of machines in the cluster before further processing.

You can read more about level of parallelism in Spark documentation
https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning

Spark Streaming and Data Locality when dynamically loading files

spark + hadoop data locality

Data Locality in Spark on Kubernetes

Apache spark data locality algorithm

Spark and HDFS on Kuberenetes data locality

spark data locality on large cluster

Does Spark use data locality?

Data locality with Spark standalone and HDFS

Does spark on mesos support data locality?

Data Locality in Spark on Kubernetes colocated with HDFS pods

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Spark Streaming and Data Locality when dynamically loading files spark + hadoop data locality Data Locality in Spark on Kubernetes Apache spark data locality algorithm Spark and HDFS on Kuberenetes data locality spark data locality on large cluster Does Spark use data locality? Data locality with Spark standalone and HDFS Does spark on mesos support data locality? Data Locality in Spark on Kubernetes colocated with HDFS pods

Related Tags

Data locality in Spark Streaming

Question

1 answers

solution1 1 2015-07-21 11:49:21

solution1
1 2015-07-21 11:49:21