简体   繁体   English

如何通过http请求将S3中的多个压缩文件读取到单个RDD中?

[英]How to read multiple gzipped files from S3 into a single RDD with http request?

I have to download many gzipped files stored on S3 like this: 我必须下载存储在S3上的许多压缩文件,如下所示:

crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00001.warc.gz

to download them you must add the prefix https://commoncrawl.s3.amazonaws.com/ 要下载它们,您必须添加前缀https://commoncrawl.s3.amazonaws.com/

I have to download and decompress the files,then assemble the content as a single RDD. 我必须下载并解压缩文件,然后将内容组装为单个RDD。

Something similar to this: 类似于以下内容:

JavaRDD<String> text = 
    sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz");

I want to do this code with spark: 我想用spark编写此代码:

    for (String key : keys) {
        object = s3.getObject(new GetObjectRequest(bucketName, key));

        gzipStream = new GZIPInputStream(object.getObjectContent());
        decoder = new InputStreamReader(gzipStream);
        buffered = new BufferedReader(decoder);

        sitemaps = new ArrayList<>();

        String line = buffered.readLine();

        while (line != null) {
            if (line.matches("Sitemap:.*")) {
                sitemaps.add(line);
            }
            line = buffered.readLine();
        }

To read something from S3, you can do this: 要从S3中读取内容,您可以执行以下操作:

sc.textFiles("s3n://path/to/dir")

If dir contains your gzip files, they will be gunzipped and combined into one RDD. 如果dir包含您的gzip文件,它们将被压缩并合并为一个RDD。 If your files are not directly at the root of the directory like this: 如果您的文件不是这样直接位于目录的根目录:

/root
  /a
    f1.gz
    f2.gz
  /b
    f3.gz

or even this: 甚至这个:

/root
  f3.gz
  /a
    f1.gz
    f2.gz

then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*") and spark will recursively find the files in dir and its subdirectories. 那么您应该使用像sc.textFiles("s3n://path/to/dir/*")这样的通配符,并且spark将在dir及其子目录中递归地找到文件。

Beware of this though. 当心虽然。 The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths. 通配符可以使用,但是您可能会在生产中的S3上遇到延迟问题,并且可能想使用AmazonS3Client检索路径。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM