简体   繁体   中英

Spark: Text file to RDD[Byte]

I need to load a text file into a RDD so that I can run tasks on the data it contains. The Driver program is written in Scala, and the code that is going to be executed in each task is a available as a native dynamic library accessed via JNI.

For now, I'm creating the RDD this way:

val rddFile : RDD[String] = sc.textFile(path);

I have the C native code for the tasks, although it uses byte-level operations on real files, that is, fgetc(). I'm trying to emulate the same kind of operation (to minimize code refactoring), but avoiding to write the fragments of data to be processed by said native library on disk , which would affect the performance.

Here is the definition of the native function and how I'm calling it:

natFunction(data : Array[String])
rddFile.glom().foreach(elem=>natFunction(elem))

However, the resulting RDD from calling textFile() holds String objects, which need to be converted on the native side of JNI to valid C strings. I believe that the performance impact of said convertion applied to every line of the file could be huge, but still less than operating on files.

I also reckon that a more compatible type would be RDD[Byte], so that I can send to the native side Arrays of Bytes, which can be converted to C strings in a more inmediate way.

Are these assumptions true? If so, what would be an efficient way to load a text file as an RDD[Byte]?

Any other suggestion to solve this problem is welcome.

You can get RDD[Byte] from RDD[String] by doing rdd.flatMap(s => s.getBytes) however beware - it very well might happen that String has 2 bytes per character (depends on locale settings, i guess).

Also when you have RDD[Byte] you will need to call, for example, mapPartitions give your data as Array[Byte] to your C code. In that case you will have quite large arrays passed to your C code, but for each partition the C app will be called only once. Another way would be to use rdd.map(s => s.getBytes) in which case you will have RDD[Array[Byte]] and thus you will have multiple C application runs per partition.

I think you can try to pipe() API for launching your C code and just pipeline RDD elements to your C code and get output of your C application for further processing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM