简体   繁体   English

在Amazon S3上保留RDD

[英]Persisting RDD on Amazon S3

I have a large text file containing JSON objects on Amazon S3. 我在Amazon S3上有一个包含JSON对象的大文本文件。 I am planning to process this data using Spark on Amazon EMR. 我计划在Amazon EMR上使用Spark处理这些数据。

Here are my questions: 这是我的问题:

  1. How do I load the text file containing JSON objects into Spark? 如何将包含JSON对象的文本文件加载到Spark中?
  2. Is it possible to persist the internal RDD representation of this data on S3, after the EMR cluster is turned-off? 在EMR集群关闭后,是否可以在S3上保留此数据的内部RDD表示?
  3. If I am able to persist the RDD representation, is it possible to directly load the data in RDD format next time I need to analyze the same data? 如果我能够持久保存RDD表示,下次需要分析相同数据时是否可以直接加载RDD格式的数据?

This should cover #1, as long as you're using pyspark: 这应该涵盖#1,只要你使用pyspark:

#Configure spark with your S3 access keys
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "MY-ACCESS-KEY")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "MY-SECRET-ACCESS-KEY")

#Retrieve the data
my_data = sc.textFile("s3n://my-bucket-name/my-key")
my_data.count() #Count all rows
my_data.take(20) #Take the first 20 rows

#Parse it
import json
my_data.map(lambda x: json.loads(x)).take(20) #Take the first 20 rows of json-parsed content

Note the s3 address is s3n:// , not s3:// . 注意s3地址是s3n:// ,而不是s3:// This is a legacy thing from hadoop. 这是hadoop的遗产。

Also, my-key can point to a whole S3 directory*. 此外, my-key可以指向整个S3目录*。 If you're using a spark cluster, importing several medium-sized files is usually faster than a single big one. 如果您正在使用spark群集,则导入多个中等大小的文件通常比单个大文件快。

For #2 and #3, I'd suggest looking up spark's parquet support. 对于#2和#3,我建议查找spark的镶木地板支架。 You can also save text back to s3: 您还可以将文本保存回s3:

my_data.map(lambda x: json.dumps(x)).saveAsTextFile('s3://my-bucket-name/my-new-key')

Not knowing the size of your dataset and the computational complexity of your pipeline, I can't say which way of storing intermediate data to S3 will be the best use of your resources. 不知道数据集的大小和管道的计算复杂性,我不知道将中间数据存储到S3的哪种方式将最有效地利用您的资源。

*S3 doesn't really have directories, but you know what I mean. * S3没有真正的目录,但你知道我的意思。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM