简体   繁体   English

Spark:文本文件转换为RDD [Byte]

[英]Spark: Text file to RDD[Byte]

I need to load a text file into a RDD so that I can run tasks on the data it contains. 我需要将文本文件加载到RDD中,以便可以对其中包含的数据运行任务。 The Driver program is written in Scala, and the code that is going to be executed in each task is a available as a native dynamic library accessed via JNI. Driver程序是用Scala编写的,将在每个任务中执行的代码都可以作为通过JNI访问的本机动态库获得。

For now, I'm creating the RDD this way: 现在,我以这种方式创建RDD:

val rddFile : RDD[String] = sc.textFile(path);

I have the C native code for the tasks, although it uses byte-level operations on real files, that is, fgetc(). 我有用于任务的C本机代码,尽管它在实际文件(即fgetc())上使用字节级操作。 I'm trying to emulate the same kind of operation (to minimize code refactoring), but avoiding to write the fragments of data to be processed by said native library on disk , which would affect the performance. 我正在尝试模拟相同类型的操作(以最大程度地减少代码重构),但避免在磁盘上写入要由所述本机库处理的数据片段,这会影响性能。

Here is the definition of the native function and how I'm calling it: 这是本机函数的定义以及我的调用方式:

natFunction(data : Array[String])
rddFile.glom().foreach(elem=>natFunction(elem))

However, the resulting RDD from calling textFile() holds String objects, which need to be converted on the native side of JNI to valid C strings. 但是,调用textFile()生成的RDD包含String对象,需要在JNI的本机端将其转换为有效的C字符串。 I believe that the performance impact of said convertion applied to every line of the file could be huge, but still less than operating on files. 我相信上述转换对文件每一行的性能影响可能是巨大的,但仍不及对文件进行操作。

I also reckon that a more compatible type would be RDD[Byte], so that I can send to the native side Arrays of Bytes, which can be converted to C strings in a more inmediate way. 我还认为更兼容的类型是RDD [Byte],这样我就可以将字节数组发送到本机端,可以以更中间的方式将其转换为C字符串。

Are these assumptions true? 这些假设是正确的吗? If so, what would be an efficient way to load a text file as an RDD[Byte]? 如果是这样,将文本文件作为RDD [Byte]加载的有效方法是什么?

Any other suggestion to solve this problem is welcome. 欢迎任何其他解决此问题的建议。

You can get RDD[Byte] from RDD[String] by doing rdd.flatMap(s => s.getBytes) however beware - it very well might happen that String has 2 bytes per character (depends on locale settings, i guess). 您可以通过执行rdd.flatMap(s => s.getBytes)RDD[String]获取RDD[Byte] ,但是请注意-很有可能String的每个字符有2个字节(我猜这取决于语言环境设置)。

Also when you have RDD[Byte] you will need to call, for example, mapPartitions give your data as Array[Byte] to your C code. 同样,当您拥有RDD [Byte]时,您将需要调用,例如, mapPartitions将数据以Array[Byte]提供给C代码。 In that case you will have quite large arrays passed to your C code, but for each partition the C app will be called only once. 在这种情况下,您将有相当大的数组传递给C代码,但是对于每个分区,C应用程序将仅被调用一次。 Another way would be to use rdd.map(s => s.getBytes) in which case you will have RDD[Array[Byte]] and thus you will have multiple C application runs per partition. 另一种方法是使用rdd.map(s => s.getBytes)在这种情况下,您将具有RDD[Array[Byte]] ,因此每个分区将运行多个C应用程序。

I think you can try to pipe() API for launching your C code and just pipeline RDD elements to your C code and get output of your C application for further processing. 我认为您可以尝试使用pipe() API启动C代码,并将RDD元素通过管道传递到C代码,并获取C应用程序的输出以进行进一步处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM