![](/img/trans.png)
[英]Merging multiple JSON files into single JSON file in S3 from AWS Lambda python function
[英]Splitting a big json file into smaller json files on AWS Lambda and save it on S3
假设这是一个大文件。
我想将文件分成多个块。 一个例子是,如果我有一个 500 MB 的 JSON 文件,我想把它分成不同的块。 可接受的最大文件大小为 30 MB(30000000 字节)。 此函数在 aws lambda 上运行,结果应保存在 s3 存储桶中。 请问我该怎么做?
{
"start":"HelloI",
"users": [
{
"id": 1,
"name": "Leanne Graham",
"username": "Bret",
"address": {
"street": "Kulas Light",
"suite": "Apt. 556",
"city": "Gwenborough"
}
},
{
...
}
]
}
这是我的代码。 我相信我做错了什么。 任何帮助将不胜感激谢谢。
json_size = 50580490;
MIN_SIZE = 30000000;
data_len = len(file)
get_array_length = len(file["users"])
print("Print data len : ",data_len)
print("Print Get Array length : ", get_array_length)
items = []
if isinstance(file, dict):
print('Valid JSON file found')
# determine number of files necessary
split_into_files = math.ceil(json_size/MIN_SIZE)
print(f'File will be split into {split_into_files} equal parts')
split_data = [[] for i in range(0,split_into_files)]
print('split_data : ', split_data)
starts = [math.floor(i * get_array_length/split_into_files) for i in range(0,split_into_files)]
starts.append(data_len)
print('starts : ', starts)
for i in range(0,split_into_files):
for n in range(starts[i], starts[i+1]):
print('The value for N is: ' , n)
print("split_data[i] :" , split_data[i])
#print(file["users"][n])
split_data[i].append(file["users"][n])
print(split_data[i])
看起来您正在以原始形式拆分数据,这意味着 json 是分层结构,当您直接拆分数据时,它不会识别记录,而是可能会破坏结构。
您可以先将用户元素读入任何其他结构,如列表/数据框。
with open('users.json','r') as f:
user_list = json.load(f)
users_data = user_list['users']
(您需要从 json 文件中的用户列表开始读取,因为文件中有另一列 - 如“开始”)
然后您将拥有 users_data 中的所有记录,然后根据 json 记录数您可以相应地拆分。 如果您想在此过程中添加一些性能以供将来使用 - 您可以对 users_data 中的记录进行排序并将记录拆分为单独的 json 文件。
我的解决方案可能不是最好的解决方案,但它有效。
import json
import boto3
import os
import time
import math
# Variable definition.
SESSION_STORAGE = os.environ['JSON_BUCKET']
SESSION = boto3.session.Session()
CURRENT_REGION = SESSION.region_name
S3_CLIENT = boto3.client("s3")
MIN_SIZE = 16000000 # 16 mb
def handler(event, context):
# Instantiate start time.
start_time = time.time()
# Bucket Name where file was uploaded
#bucket = event['Records'][0]['s3']['bucket']['name']
bucket = 'json-upload-bucket' # => testing bucket.
#file_key_name = event['Records'][0]['s3']['object']['key'] # Use this to make it dynamic.
file_key_name = 'XXXXXXXXX-users.json' # This is for testing only.
print("File Key Name", file_key_name)
response = S3_CLIENT.get_object(Bucket=bucket, Key=file_key_name)
#print("Response : ", response)
# json size
json_size = response['ContentLength']
print("json_size : ", json_size)
# Reading content
content = response['Body']
jsonObject = json.loads(content.read())
data = jsonObject['users']
data_len = len(jsonObject)
#print('Length of JSON : ', data_len)
#print('Order array length : ', len(data))
if isinstance(data, list):
data_len = len(data)
print('Valid JSON file found')
if(json_size <= MIN_SIZE):
print('File meets the minimum size.')
else:
# determine number of files necessary
split_into_files = math.ceil(json_size/MIN_SIZE)
print(f'File will be split into {split_into_files} equal parts')
# initialize 2D array
split_data = [[] for i in range(0,split_into_files)]
# determine indices of cutoffs in array
starts = [math.floor(i * data_len/split_into_files) for i in range(0,split_into_files)]
starts.append(data_len)
# loop through 2D array
for i in range(0,split_into_files):
# loop through each range in array
for n in range(starts[i],starts[i+1]):
split_data[i].append(data[n])
print(file_key_name.split('.')[0] + '_' + str(i+1) + '.json')
name = os.path.basename(file_key_name).split('.')[0] + '_' + str(i+1) + '.json'
print('Name : ', name)
folder = '/tmp/'+name
with open(folder, 'w') as outfile:
# restructure the json back to its original state.
generated_json = {
list(jsonObject.keys())[0] : list(jsonObject.values())[0],
list(jsonObject.keys())[1] : split_data[i]}
json.dump(generated_json, outfile, indent=4)
S3_CLIENT.upload_file(folder, bucket, name)
print('Part',str(i+1),'... completed')
else:
print("JSON is not an Array of Objects")
return {
'statusCode': 200,
'body': json.dumps('JSON split completed checks s3.')
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.