使用Spark从Hadoop读取JSON文件

Question

I have several JSON files (zipped with .gz format) in some HDFS directories in a tree like: 我在树中的某些HDFS目录中有几个JSON文件（以.gz格式压缩）：

/master/dir1/file1.gz
       /dir2/file2.gz
       /dir3/file3.gz
       ...

I need to read those files from the path /master/ and join them into a RDD with Spark in Java. 我需要从路径/ master /中读取这些文件，并使用Java中的Spark将它们加入到RDD中。 How could I do it? 我该怎么办？

Answer 1

[Edit] If [编辑]如果

JavaRDD<String> textFile = sc.textFile("hdfs://master/dir*/file*");

doesn't work, another way is to list the files and union 不起作用，另一种方法是列出文件并合并

fileSystem.listStatus(new Path("hdfs://master/dir*"))
  .filter(d -> d.isDirectory())
  .map(p -> sc.textFile(p.getPath()))
  .reduce((a, b) -> a.unionAll(b))

使用Spark从Hadoop读取JSON文件

问题描述

1 个解决方案

解决方案1
0 2016-04-29 12:18:54

使用Spark从Hadoop读取JSON文件

问题描述

1 个解决方案

解决方案1 0 2016-04-29 12:18:54

解决方案1
0 2016-04-29 12:18:54