简体   繁体   中英

How to open a file in the rdd having path to this file?

I am working with sentinel images data on apache spark using Scala. At some step I filter metadata that contains specific location and for those data I want to open new file located in subfolder.

Filter rdd contains key as path to file with globalmetadata and value as path to file which I would like to open.

var global_and_cloud=global_filter.map{case(name, positions_list, granule)=>
(name, (name.substring(0, name.length-14)+granule.substring(13,56)+"QI_DATA/MSK_CLOUDS_B00.gml"))}

The best I can do is

var global_and_cloud2=global_and_cloud.map{case(name, cloud_path)=>
(sc.wholeTextFiles(cloud_path).first._1, sc.wholeTextFiles(cloud_path).first._2)}

but it's throwing exeptions java.lang.NullPointerException when I want to take action on it,

and when I do

sc.wholeTextFiles(global_and_cloud.first._2).first._2

i get a content of file so it's exist

Is there any way to read a file inside rdd?

You can't use any of Spark's driver-side abstractions ( SparkSession , RDD , DataFrame etc.) within any function used to operate on an RDD's data (ie functions passed to RDD.map , RDD.filter etc.) - see full explanation here: Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset .

You'll have to collect() the global_and_cloud RDD, which would create a local array (in driver application's memory) of file names, which you can then map into an Array of file names and the RDD holding that file's data, something like:

val files: Array[(String, String)] = global_and_cloud.collect()

// since "files" is a "local" array and not an RDD - we can use 
// "sc" when mapping its values:
val rdds: Array[(String, RDD[String])] = files.map {
  case(name, cloud_path) => (name, sc.textFile(cloud_path))
}

NOTE that if global_and_cloud is too large to be collected into local memory, this might cause slowness or OutOfMemoryError . But that would mean you're trying to "open" millions of files, which would fail anyway (would require too much Driver memory to hold that many RDDs).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM