I am working with sentinel images data on apache spark using Scala. At some step I filter metadata that contains specific location and for those data I want to open new file located in subfolder.
Filter rdd contains key as path to file with globalmetadata and value as path to file which I would like to open.
var global_and_cloud=global_filter.map{case(name, positions_list, granule)=>
(name, (name.substring(0, name.length-14)+granule.substring(13,56)+"QI_DATA/MSK_CLOUDS_B00.gml"))}
The best I can do is
var global_and_cloud2=global_and_cloud.map{case(name, cloud_path)=>
(sc.wholeTextFiles(cloud_path).first._1, sc.wholeTextFiles(cloud_path).first._2)}
but it's throwing exeptions java.lang.NullPointerException when I want to take action on it,
and when I do
sc.wholeTextFiles(global_and_cloud.first._2).first._2
i get a content of file so it's exist
Is there any way to read a file inside rdd?
You can't use any of Spark's driver-side abstractions ( SparkSession
, RDD
, DataFrame
etc.) within any function used to operate on an RDD's data (ie functions passed to RDD.map
, RDD.filter
etc.) - see full explanation here: Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset .
You'll have to collect()
the global_and_cloud
RDD, which would create a local array (in driver application's memory) of file names, which you can then map into an Array of file names and the RDD holding that file's data, something like:
val files: Array[(String, String)] = global_and_cloud.collect()
// since "files" is a "local" array and not an RDD - we can use
// "sc" when mapping its values:
val rdds: Array[(String, RDD[String])] = files.map {
case(name, cloud_path) => (name, sc.textFile(cloud_path))
}
NOTE that if global_and_cloud
is too large to be collected into local memory, this might cause slowness or OutOfMemoryError
. But that would mean you're trying to "open" millions of files, which would fail anyway (would require too much Driver memory to hold that many RDDs).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.