简体   繁体   English

使用 pyspark kernel 模式从 sagemaker 读取 csv.gz 文件

[英]reading a csv.gz file from sagemaker using pyspark kernel mode

i am trying to read a compressed csv file in pyspark. but i am unable to read in pyspark kernel mode in sagemaker.我正在尝试读取 pyspark 中的压缩文件 csv。但我无法在 pyspark kernel 模式下读取 sagemaker。

The same file i can read using pandas when the kernel is conda-python3 (in sagemaker)当 kernel 是 conda-python3(在 sagemaker 中)时,我可以使用 pandas 读取相同的文件

What I tried:我尝试了什么:

file1 =  's3://testdata/output1.csv.gz'
file1_df = spark.read.csv(file1, sep='\t')

Error message:错误信息:

An error was encountered:
An error occurred while calling 104.csv.
: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 7FF77313; S3 Extended Request ID: 

Kindly let me know if i am missing anything如果我遗漏了什么,请告诉我

An error was encountered: An error occurred while calling 104.csv.遇到错误:调用 104.csv 时出错。 : java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 7FF77313; S3 Extended Request ID: : java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID:7FF77313;S3 扩展请求 ID:

There are other Hadoop connectors to S3. S3 还有其他 Hadoop 连接器。 Only S3A is actively maintained by the Hadoop project itself.只有 S3A 由 Hadoop 项目本身积极维护。 Apache's Hadoop's original s3:// client. Apache 的 Hadoop 的原始 s3:// 客户端。 This is no longer included in Hadoop. Apache's Hadoop's s3n: filesystem client. Hadoop 中不再包含此内容。 Apache 的 Hadoop 的 s3n:文件系统客户端。 This connector is no longer available: users must migrate to the newer s3a.此连接器不再可用:用户必须迁移到较新的 s3a。

I have attached a document for your reference Apache S3 Connectors我附上了一份文件供您参考Apache S3 连接器

PySpark reads gz file automatically as per the document that they have provided. PySpark 根据他们提供的文档自动读取 gz 文件。 Click Spark Programming Guide for the document.单击文档的Spark 编程指南

file1 =  's3://testdata/output1.csv.gz'
rdd = sc.textFile(file1)
rdd.take(10)

To load files in dataframe加载文件 dataframe

df = spark.read.csv(file1) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 加载 csv.gz 从 url 到 bigquery - Loading csv.gz from url to bigquery 使用 org.apache.hadoop:hadoop-aws 从 pyspark 中的 s3 读取文件 - Reading file from s3 in pyspark using org.apache.hadoop:hadoop-aws 在 AWS Sagemaker 中训练 scikit 学习模型时无法创建 model.tar.gz 文件 - Couldn't create model.tar.gz file while training scikit learn model in AWS Sagemaker 没有这样的文件或目录:'docker':'docker' 在本地模式下运行 sagemaker studio 时 - No such file or directory: 'docker': 'docker' when running sagemaker studio in local mode 无法再在 SageMaker Studio Lab 中读取上传的 csv 文件 - Can't read uploaded csv file in SageMaker Studio Lab anymore 从 AWS lambda function 中的 s3 存储桶中读取 .mdb 或 .accdb 文件并使用 python 将其转换为 excel 或 csv - Reading .mdb or .accdb file from s3 bucket in AWS lambda function and converting it into excel or csv using python 将 csv 文件从 s3 读入 ExcelJS 工作簿 - Reading csv file from s3 into ExcelJS Workbook 如何使用 AWS Lambda 函数从 S3 解码 a.gz 文件? - How can I decode a .gz file from S3 using an AWS Lambda function? PySpark中s3子目录读取数据 - Reading data from s3 subdirectories in PySpark 如何从 S3 的 Pyspark 子文件夹中创建一个新的 dataframe 和 CSV 文件 - How to create a new dataframe with CSV file from a folder with subfolders in Pyspark in S3
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM