如何从s3存储桶中仅读取5条记录并在不获取csv文件的所有数据的情况下返回它

Question

Hello guys I know lots of similar questions i'll find here but i have a code which is executing properly which is returning five records also my query is how should i only read the entire file and atlast return with desire rows just supose i have csv file which have size in gb so i don't want to return the entire gb file data for getting only 5 records so please tell me how should i get it....Please if possible explain my code if it is not good why it is not good.. code:大家好，我知道很多类似的问题，我会在这里找到，但我有一个正确执行的代码，它返回五个记录，我的查询是我应该如何只读取整个文件并最终返回所需的行，只是假设我有 csv文件的大小以 gb 为单位，所以我不想返回整个 gb 文件数据只获取 5 条记录，所以请告诉我应该如何获取它....如果可能的话，请解释我的代码，如果它不好，为什么它不好..代码：

import boto3
from botocore.client import Config
import pandas as pd

ACCESS_KEY_ID = 'something'
ACCESS_SECRET_KEY = 'something'
BUCKET_NAME = 'something'
Filename='dataRepository/source/MergedSeedData(Parts_skills_Durations).csv'

client = boto3.client("s3",
                     aws_access_key_id=ACCESS_KEY_ID,
                     aws_secret_access_key=ACCESS_SECRET_KEY)
obj = client.get_object(Bucket=BUCKET_NAME, Key=Filename)
Data = pd.read_csv(obj['Body'])
# data1 = Data.columns
# return data1
Data=Data.head(5)
print(Data)

This my code which is running fine also getting the 5 records from s3 bucket but i have explained it what i'm looking for any other query feel free to text me...thnxx in advance这是我运行良好的代码，也从 s3 存储桶中获取了 5 条记录，但我已经解释了我正在寻找的任何其他查询，请随时给我发短信...thnxx

Answer 1

You can use the pandas capability of reading a file in chunks , just loading as much data as you need.您可以使用 Pandas 以块形式读取文件的功能，只需根据需要加载尽可能多的数据。

data_iter = pd.read_csv(obj['Body'], chunksize = 5)
data = data_iter.get_chunk()
print(data)

Answer 2

You can use a HTTP Range: header ( see RFC 2616 ), which take a byte range argument.您可以使用 HTTP Range:标头（请参阅 RFC 2616 ），它采用字节范围参数。 S3 APIs have a provision for this and this will help you to NOT read/download the whole S3 file. S3 API 对此有一个规定，这将帮助您不要读取/下载整个 S3 文件。

Sample code:示例代码：

import boto3
obj = boto3.resource('s3').Object('bucket101', 'my.csv')
record_stream = obj.get(Range='bytes=0-1000')['Body']
print(record_stream.read())

This will return only the byte_range_data provided in the header.这将仅返回标头中提供的 byte_range_data。

But you will need to modify this to convert the string into Dataframe .但是您需要修改它以将字符串转换为Dataframe 。 Maybe read + join for the \\t and \\n present in the string coming from the .csv file也许read + join出现在来自.csv文件的字符串中的\\t和\\n

如何从s3存储桶中仅读取5条记录并在不获取csv文件的所有数据的情况下返回它

问题描述

2 个解决方案

解决方案1
3 已采纳 2019-03-28 12:04:27

解决方案2
1 2019-03-28 12:19:17

如何从s3存储桶中仅读取5条记录并在不获取csv文件的所有数据的情况下返回它

问题描述

2 个解决方案

解决方案1 3 已采纳 2019-03-28 12:04:27

解决方案2 1 2019-03-28 12:19:17

解决方案1
3 已采纳 2019-03-28 12:04:27

解决方案2
1 2019-03-28 12:19:17