简体   繁体   English

spark streaming fileStream

[英]spark streaming fileStream

I'm programming with spark streaming but have some trouble with scala. 我正在使用火花流编程但是在scala上遇到了一些问题。 I'm trying to use the function StreamingContext.fileStream 我正在尝试使用StreamingContext.fileStream函数

The definition of this function is like this: 这个函数的定义是这样的:

def fileStream[K, V, F <: InputFormat[K, V]](directory: String)(implicit arg0: ClassManifest[K], arg1: ClassManifest[V], arg2: ClassManifest[F]): DStream[(K, V)]

Create a input stream that monitors a Hadoop-compatible filesystem for new files and reads them using the given key-value types and input format. 创建一个输入流,监视与Hadoop兼容的文件系统以获取新文件,并使用给定的键值类型和输入格式读取它们。 File names starting with . 文件名以。开头。 are ignored. 被忽略了。 K Key type for reading HDFS file V Value type for reading HDFS file F Input format for reading HDFS file directory HDFS directory to monitor for new file K用于读取HDFS文件的密钥类型V用于读取HDFS文件的值类型F用于读取HDFS文件目录的输入格式用于监视新文件的HDFS目录

I don't know how to pass the type of Key and Value. 我不知道如何传递Key和Value的类型。 My Code in spark streaming: 火花流中的我的代码:

val ssc = new StreamingContext(args(0), "StreamingReceiver", Seconds(1),
  System.getenv("SPARK_HOME"), Seq("/home/mesos/StreamingReceiver.jar"))

// Create a NetworkInputDStream on target ip:port and count the
val lines = ssc.fileStream("/home/sequenceFile")

Java code to write the hadoop file: 用于编写hadoop文件的Java代码:

public class MyDriver {

private static final String[] DATA = { "One, two, buckle my shoe",
        "Three, four, shut the door", "Five, six, pick up sticks",
        "Seven, eight, lay them straight", "Nine, ten, a big fat hen" };

public static void main(String[] args) throws IOException {
    String uri = args[0];
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(URI.create(uri), conf);
    Path path = new Path(uri);
    IntWritable key = new IntWritable();
    Text value = new Text();
    SequenceFile.Writer writer = null;
    try {
        writer = SequenceFile.createWriter(fs, conf, path, key.getClass(),
                value.getClass());
        for (int i = 0; i < 100; i++) {
            key.set(100 - i);
            value.set(DATA[i % DATA.length]);
            System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key,
                    value);
            writer.append(key, value);
        }
    } finally {
        IOUtils.closeStream(writer);
    }
}

} }

If you want to use fileStream , you're going to have to supply all 3 type params to it when calling it. 如果你想使用fileStream ,你将不得不在调用它时为它提供所有3种类型的参数。 You need to know what your Key , Value and InputFormat types are before calling it. 在调用它之前,您需要知道KeyValueInputFormat类型是什么。 If your types were LongWritable , Text and TextInputFormat , you would call fileStream like so: 如果您的类型是LongWritableTextTextInputFormat ,您可以像这样调用fileStream

val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/home/sequenceFile")

If those 3 types do happen to be your types, then you might want to use textFileStream instead as it does not require any type params and delegates to fileStream using those 3 types I mentioned. 如果这3种类型恰好是您的类型,那么您可能希望使用textFileStream因为它不需要任何类型的params并使用我提到的3种类型委托给fileStream Using that would look like this: 使用它看起来像这样:

val lines = ssc.textFileStream("/home/sequenceFile")
val filterF = new Function[Path, Boolean] {
    def apply(x: Path): Boolean = {
      val flag = if(x.toString.split("/").last.split("_").last.toLong < System.currentTimeMillis) true else false
      return flag
    }
}

val streamed_rdd = ssc.fileStream[LongWritable, Text, TextInputFormat]("/user/hdpprod/temp/spark_streaming_input",filterF,false).map(_._2.toString).map(u => u.split('\t'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM