简体   繁体   中英

How to ensure data locality in Spark data source v2?

I implement Spark data source (v2) and I didn't find a way to ensure data locality.

In data source v1 getPreferredLocations method can be implemented, what is the equivalent in data source v2?

In Spark data source v2 you should change to SupportsReportPartitioning

I see someone discuss some limitation in this issue SPARK-15689 - Data source API v2

So SupportsReportPartitioning is not powerful enough to support custom hash functions yet. There are two major operators that may introduce shuffle: join and aggregate. Aggregate only needs to have the data clustered, but doesn't care how, so the data source v2 can support it, if your implementation catches ClusteredDistribution. Join needs the data of the 2 children clustered by the spark shuffle hash function, which is not supported by data source v2 currently.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM