[英]How to convert csv file in S3 bucket to RDD
I'm pretty new with this topic so any help will be much appreciated. 我对这个话题很新,所以任何帮助都会非常感激。
I trying to read a csv file which is stored in a S3 bucket and convert its data to an RDD to work directly with it without the need to create a file locally. 我试图读取存储在S3存储桶中的csv文件,并将其数据转换为RDD以直接使用它,而无需在本地创建文件。
So far I've been able to load the file using AmazonS3ClientBuilder, but the only thing I've got is to have the file content in a S3ObjectInputStream and I'm not able to work with its content. 到目前为止,我已经能够使用AmazonS3ClientBuilder加载文件,但我唯一得到的是将文件内容放在S3ObjectInputStream中,而我无法使用其内容。
val bucketName = "bucket-name"
val credentials = new BasicAWSCredentials(
"acessKey",
"secretKey"
);
val s3client = AmazonS3ClientBuilder
.standard()
.withCredentials(new AWSStaticCredentialsProvider(credentials))
.withRegion(Regions.US_EAST_2)
.build();
val s3object = s3client.getObject(bucketName, "file-name.csv")
val inputStream = s3object.getObjectContent()
....
I have also tried to use a BufferedSource to work with it but once done, I don't know how to convert it to a dataframe or RDD to work with it. 我也尝试使用BufferedSource来处理它,但是一旦完成,我不知道如何将它转换为数据帧或RDD来使用它。
val myData = Source.fromInputStream(inputStream)
....
You can do it with S3A file system provided in Hadoop-AWS module: 您可以使用Hadoop-AWS模块中提供的S3A文件系统来执行此操作:
<property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3a.S3AFileSystem</value></property>
in core-site.xml or add .config("fs.s3.impl", classOf[S3AFileSystem].getName)
to SparkSession
builder <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3a.S3AFileSystem</value></property>
或添加.config("fs.s3.impl", classOf[S3AFileSystem].getName)
到SparkSession
构建器 spark.read.csv("s3://bucket/key")
. spark.read.csv("s3://bucket/key")
访问S3。 If you want the RDD that was asked spark.read.csv("s3://bucket/key").rdd
spark.read.csv("s3://bucket/key").rdd
最后,我能够得到我正在寻找的结果,看看https://gist.github.com/snowindy/d438cb5256f9331f5eec
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.