简体   繁体   中英

Streaming from Spark RDD to Scala Process

I have a Spark RDD[String] that I would like to stream to the input of an external command on the local machine. The setup would be something like this

val data: RDD[String] = <Valid data>
val process = Seq("wc", "-l") // This is not the actual process, but it works the same way as it consumes a whole bunch of lines and produces very little output itself
// Here's what I've tried so far
val exitCode = (process #< data.toLocalIterator.toStream) ! // Doesn't work
val exitCode = (process #< new ByteArrayInputStream(data.toLocalIterator.mkString("\n").getBytes("UTF-8"))) ! // Works but seems to load the whole data into local memory which is definitely not what I want as data could be very big

val processIO = new ProcessIO(
  in => data.toLocalIterator.toStream,
  out => scala.io.Source.fromInputStream(out).getLines.foreach(println),
  err => scala.io.Source.fromInputStream(err).getLines.foreach(println))

val exitCode = process.run(processIO) // This also doesn't work

Can anyone point me to a working solution that doesn't load all the data on the local machine and just streams it from an RDD[String] straight to the process, just like I'd do with

cat data.txt | wc -l

on the command line.

Thanks

I think I've figured this out. It seems that I forgot to actually write anything to the InputStream. Here is code that seems to be working for my small tests. I still haven't tested it on the big data yet, but it looks like it should work.

val processIO = BasicIO.standard(in => {
  data.toLocalIterator.foreach(x => in.write((x + Properties.lineSeparator).getBytes(Charsets.UTF_8)))
  in.close
})

val exitCode = process.run(processIO).exitValue

This is not an answer but you should be aware that it won't behave like cat data.txt | wc -l cat data.txt | wc -l since the RDD can (and usually will) be split into multiple processes (tasks running in executors) so your accepting program needs to be able to get multiple streams and your should know that the data will not be ordered

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM