在Amazon S3上保留RDD

Question

I have a large text file containing JSON objects on Amazon S3. 我在Amazon S3上有一个包含JSON对象的大文本文件。 I am planning to process this data using Spark on Amazon EMR. 我计划在Amazon EMR上使用Spark处理这些数据。

Here are my questions: 这是我的问题：

How do I load the text file containing JSON objects into Spark? 如何将包含JSON对象的文本文件加载到Spark中？
Is it possible to persist the internal RDD representation of this data on S3, after the EMR cluster is turned-off? 在EMR集群关闭后，是否可以在S3上保留此数据的内部RDD表示？
If I am able to persist the RDD representation, is it possible to directly load the data in RDD format next time I need to analyze the same data? 如果我能够持久保存RDD表示，下次需要分析相同数据时是否可以直接加载RDD格式的数据？

Answer 1

This should cover #1, as long as you're using pyspark: 这应该涵盖＃1，只要你使用pyspark：

#Configure spark with your S3 access keys
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "MY-ACCESS-KEY")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "MY-SECRET-ACCESS-KEY")

#Retrieve the data
my_data = sc.textFile("s3n://my-bucket-name/my-key")
my_data.count() #Count all rows
my_data.take(20) #Take the first 20 rows

#Parse it
import json
my_data.map(lambda x: json.loads(x)).take(20) #Take the first 20 rows of json-parsed content

Note the s3 address is s3n:// , not s3:// . 注意s3地址是s3n:// ，而不是s3:// 。 This is a legacy thing from hadoop. 这是hadoop的遗产。

Also, my-key can point to a whole S3 directory*. 此外， my-key可以指向整个S3目录*。 If you're using a spark cluster, importing several medium-sized files is usually faster than a single big one. 如果您正在使用spark群集，则导入多个中等大小的文件通常比单个大文件快。

For #2 and #3, I'd suggest looking up spark's parquet support. 对于＃2和＃3，我建议查找spark的镶木地板支架。 You can also save text back to s3: 您还可以将文本保存回s3：

my_data.map(lambda x: json.dumps(x)).saveAsTextFile('s3://my-bucket-name/my-new-key')

Not knowing the size of your dataset and the computational complexity of your pipeline, I can't say which way of storing intermediate data to S3 will be the best use of your resources. 不知道数据集的大小和管道的计算复杂性，我不知道将中间数据存储到S3的哪种方式将最有效地利用您的资源。

*S3 doesn't really have directories, but you know what I mean. * S3没有真正的目录，但你知道我的意思。

在Amazon S3上保留RDD

问题描述

1 个解决方案

解决方案1
10 2014-07-03 23:32:32

在Amazon S3上保留RDD

问题描述

1 个解决方案

解决方案1 10 2014-07-03 23:32:32

解决方案1
10 2014-07-03 23:32:32