简体繁体中英

How to ensure data locality in Spark data source v2?

原文 2019-03-06 08:25:07 0 1 apache-spark/ apache-spark-sql/ datasource

I implement Spark data source (v2) and I didn't find a way to ensure data locality.

In data source v1 getPreferredLocations method can be implemented, what is the equivalent in data source v2?

1 answers

In Spark data source v2 you should change to SupportsReportPartitioning

I see someone discuss some limitation in this issue SPARK-15689 - Data source API v2

So SupportsReportPartitioning is not powerful enough to support custom hash functions yet. There are two major operators that may introduce shuffle: join and aggregate. Aggregate only needs to have the data clustered, but doesn't care how, so the data source v2 can support it, if your implementation catches ClusteredDistribution. Join needs the data of the 2 children clustered by the spark shuffle hash function, which is not supported by data source v2 currently.

Data locality in Spark Streaming

Data Locality in Spark on Kubernetes

spark + hadoop data locality

spark data locality on large cluster

Does Spark use data locality?

Data locality with Spark standalone and HDFS

Apache spark data locality algorithm

Spark and HDFS on Kuberenetes data locality

Does Spark Data Source V2 API support repartitioning of the input RDD for writes?

Reading Cassandra from Spark, getting "IllegalArgumentException: Unsupported data source V2 partitioning type: CassandraPartitioning"

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Data locality in Spark Streaming Data Locality in Spark on Kubernetes spark + hadoop data locality spark data locality on large cluster Does Spark use data locality? Data locality with Spark standalone and HDFS Apache spark data locality algorithm Spark and HDFS on Kuberenetes data locality Does Spark Data Source V2 API support repartitioning of the input RDD for writes? Reading Cassandra from Spark, getting "IllegalArgumentException: Unsupported data source V2 partitioning type: CassandraPartitioning"

Related Tags

How to ensure data locality in Spark data source v2?

Question

1 answers

solution1 0 2019-03-06 08:47:05

solution1
0 2019-03-06 08:47:05