简体繁体 English

使用Apache Spark从Amazon S3解析文件

[英]Parsing files from Amazon S3 with Apache Spark

原文 2017-04-27 09:49:46 8 1 java/ amazon-web-services/ apache-spark/ amazon-s3

I am using Apache Spark and I have to parse files from Amazon S3. 我正在使用Apache Spark，并且必须解析Amazon S3中的文件。 How would I know file extension while fetching the files from Amazon S3 path? 从Amazon S3路径获取文件时，我怎么知道文件扩展名？

1 个解决方案

I suggest to follow Cloudera tutorial Accessing Data Stored in Amazon S3 through Spark 我建议遵循Cloudera教程通过Spark访问存储在Amazon S3中的数据

To access data stored in Amazon S3 from Spark applications, you could use Hadoop file APIs ( SparkContext.hadoopFile , JavaHadoopRDD.saveAsHadoopFile , SparkContext.newAPIHadoopRDD , and JavaHadoopRDD.saveAsNewAPIHadoopFile ) for reading and writing RDDs, providing URLs of the form s3a://bucket_name/path/to/file.txt . 要从Spark应用程序访问存储在Amazon S3中的数据，您可以使用Hadoop文件API（ SparkContext.hadoopFile ， JavaHadoopRDD.saveAsHadoopFile ， SparkContext.newAPIHadoopRDD和JavaHadoopRDD.saveAsNewAPIHadoopFile ）读取和写入RDD，并提供形式为s3a://bucket_name/path/to/file.txt URL。 s3a://bucket_name/path/to/file.txt 。

You can read and write Spark SQL DataFrames using the Data Source API. 您可以使用数据源API读写Spark SQL DataFrame。

Regarding the file extension, there are few solutions. 关于文件扩展名，几乎没有解决方案。 You could simply take the extension by the filename (ie file.txt ). 您可以简单地通过文件名（即file.txt ）进行扩展。

If your extensions were removed by files stored in your S3 buckets, you could still know the content-type looking at metadata added for each S3 resource. 如果您的扩展程序被存储在S3存储桶中的文件删除了，那么您仍然可以了解为每个S3资源添加的元数据的内容类型。

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectHEAD.html http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectHEAD.html