Apache Spark以正则表达式读取文件

Question

我正在将流发送到HDFS，并尝试使用spark读取文本文件。

JavaStreamingContext jssc = new JavaStreamingContext(jsc, new    
 Duration(1000));
JavaPairInputDStream<LongWritable, Text> textStream =   
jssc.fileStream("hdfs://myip:9000/travel/FlumeData.[0-9]*", 
LongWritable.class, Text.class, TextInputFormat.class);

在将流发送到hdfs时，会创建一些FlumeData.1234.tmp文件，一旦接收到完整数据，该文件将转换为适当的文件，例如。 FlumeData.1234

我想忽略此.tmp文件以从spark读取。 我尝试使用正则表达式

hdfs：// myip：9000 / travel / FlumeData。[0-9] * hdfs：// myip：9000 / travel / FlumeData .// d *

但他们没有工作。 我正在寻找类似jssc.fileStream（“ hdfs：// myip：9000 / travel / FlumeData。[0-9] *”，LongWritable.class，Text.class，TextInputFormat.class）的东西；

fileStream不应从文件扩展名读取.tmp。

我也尝试按照Hadoop代码检索苍蝇列表

private  String pathValue(String PathVariable) throws IOException{



      Configuration conf = new Configuration();
      Path path = new Path(PathVariable);
      FileSystem fs = FileSystem.get(path.toUri(), conf);
      System.out.println("PathVariable" + fs.getWorkingDirectory());

      return fs.getName();
   }

但它的FileSystem对象fs没有filename（）。 由于新文件是在运行时创建的。 我需要阅读他们创建的内容。

Answer 1

您需要使用（）选择器来选择可以从匹配项中保留的部分。 如果未指定任何部分，则返回整个匹配项。

就您而言，如果我没有误会，请在示例中选择：

FlumeData.1234 from FlumeData.1234.tmp

为此，您需要的简单正则表达式为：

(.*).tmp

如果要选择.tmp扩展名之前的所有内容。

Answer 2

JavaPairInputDStream重载的fileStream方法具有过滤器功能，我们可以编写一个过滤器功能来过滤目录中的文件。

fileStream(directory, kClass, vClass, fClass, filter, newFilesOnly)

JavaPairInputDStream<LongWritable, Text> lines = jssc.fileStream("hdfs://myip:9000/travel/", LongWritable.class, Text.class, TextInputFormat.class, new Function<Path,Boolean> () {
        public Boolean call(Path path) throws Exception {
            System.out.println("Is path :"+path.getName());
            Pattern pattern =  Pattern.compile("FlumeData.[0-9]*");
            Matcher m = pattern.matcher(path.getName());
            System.out.println("Is path : " + path.getName().toString() + " matching "
                + " ? , " + m.matches());
            return  m.matches();
        }}, true);

请使用上面的代码运行，希望能解决该问题。

Apache Spark以正则表达式读取文件

问题描述

2 个解决方案

解决方案1
0 2016-03-03 08:07:09

解决方案2
0 已采纳 2016-03-03 10:01:10

Apache Spark以正则表达式读取文件

问题描述

2 个解决方案

解决方案1 0 2016-03-03 08:07:09

解决方案2 0 已采纳 2016-03-03 10:01:10

解决方案1
0 2016-03-03 08:07:09

解决方案2
0 已采纳 2016-03-03 10:01:10