简体   繁体   English

使用 pd.read_csv() 在 S3 位置读取 csv 文件的编码问题

[英]Encoding issue using pd.read_csv() to read csv file in S3 location

Problem: I am getting an encoding error while trying to use pd.read_csv() to read a CSV file in an S3 location.问题:我在尝试使用 pd.read_csv() 读取 S3 位置中的 CSV 文件时遇到编码错误。

Below is my code:下面是我的代码:

 # parameters
 s3_bucket = 'my_bucket'
 s3_key = 'my_key'

 # create s3 client
 s3_client = boto3.client('s3')

 # create s3 object
 obj = s3_client.get_object(Bucket=s3_bucket, Key=s3_key)
   
 # read csv file from s3
 df = pd.read_csv(obj['Body'], encoding='cp1252')

But this is the error I get: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte但这是我得到的错误: “utf-8”编解码器无法解码位置 0 中的字节 0xff:无效起始字节

I don't understand why I am getting the encoding error when I am specifying the encoding to be 'cp1252'.我不明白为什么在将编码指定为“cp1252”时会出现编码错误。 By the way, 'cp1252' is the encoding I found for my csv file.顺便说一句,'cp1252' 是我为我的 csv 文件找到的编码。

I looked into the boto3 documentation and the get_object() method returns a StreamBody .我查看了boto3 文档get_object()方法返回一个StreamBody The pandas method read_csv() takes a path, file, buffer and so on as input ( documentation ). pandas 方法read_csv()将路径、文件、缓冲区等作为输入( 文档)。

Therefore, I think you have to convert the object body first.因此,我认为您必须先转换对象主体。 This can be done with Python's io module ( documentation ).这可以通过 Python 的io模块(文档)来完成。 The following code should fix your problem:以下代码应该可以解决您的问题:

obj = s3_client.get_object(Bucket=s3_bucket, Key=s3_key)
df = pd.read_csv(io.BytesIO(obj['Body'].read()))

Explanation: Pandas states in the doc:说明: Pandas 在文档中声明:

By file-like object, we refer to objects with a read() method, such as a file handle (eg via builtin open function) or StringIO.通过类文件对象,我们指的是具有 read() 方法的对象,例如文件句柄(例如通过内置的 open 函数)或 StringIO。

This is fulfilled by giving the StreamBody to io.BytesIO from which you can read the bytes of your file.这是通过将StreamBody提供给StreamBodyio.BytesIO的,您可以从中读取文件的字节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM