在 AWS Lambda 上将大 json 文件拆分为较小的 json 文件并将其保存在 S3 上

Question

假设这是一个大文件。

我想将文件分成多个块。 一个例子是，如果我有一个 500 MB 的 JSON 文件，我想把它分成不同的块。 可接受的最大文件大小为 30 MB（30000000 字节）。 此函数在 aws lambda 上运行，结果应保存在 s3 存储桶中。 请问我该怎么做？

{
   "start":"HelloI",
   "users": [
  {
    "id": 1,
    "name": "Leanne Graham",
    "username": "Bret",
    "address": {
      "street": "Kulas Light",
      "suite": "Apt. 556",
      "city": "Gwenborough"
    }
  },
  {
   ...
  }
]
}

这是我的代码。 我相信我做错了什么。 任何帮助将不胜感激谢谢。

json_size = 50580490;
MIN_SIZE = 30000000;
data_len = len(file)

get_array_length = len(file["users"])

print("Print data len : ",data_len)
print("Print  Get Array length  : ", get_array_length)

items = []
if isinstance(file, dict):
  print('Valid JSON file found')

  # determine number of files necessary
  split_into_files = math.ceil(json_size/MIN_SIZE)
  print(f'File will be split into {split_into_files} equal parts')

  split_data = [[] for i in range(0,split_into_files)]
  print('split_data : ', split_data)

  starts = [math.floor(i * get_array_length/split_into_files) for i in range(0,split_into_files)]
  starts.append(data_len)
  print('starts : ', starts)

  for i in range(0,split_into_files):
    for n in range(starts[i], starts[i+1]):
      print('The value for N is: ' , n)     
      print("split_data[i] :" , split_data[i])
      #print(file["users"][n])
      split_data[i].append(file["users"][n])
      print(split_data[i])

Answer 1

看起来您正在以原始形式拆分数据，这意味着 json 是分层结构，当您直接拆分数据时，它不会识别记录，而是可能会破坏结构。

您可以先将用户元素读入任何其他结构，如列表/数据框。

with open('users.json','r') as f:
    user_list = json.load(f)
    users_data = user_list['users']

（您需要从 json 文件中的用户列表开始读取，因为文件中有另一列 - 如“开始”）

然后您将拥有 users_data 中的所有记录，然后根据 json 记录数您可以相应地拆分。 如果您想在此过程中添加一些性能以供将来使用 - 您可以对 users_data 中的记录进行排序并将记录拆分为单独的 json 文件。

Answer 2

我的解决方案可能不是最好的解决方案，但它有效。

import json
import boto3
import os
import time
import math


# Variable definition.
SESSION_STORAGE = os.environ['JSON_BUCKET']
SESSION = boto3.session.Session()
CURRENT_REGION = SESSION.region_name
S3_CLIENT = boto3.client("s3")
MIN_SIZE = 16000000 # 16 mb

def handler(event, context):
    
    # Instantiate start time.
    start_time = time.time()
    
    # Bucket Name where file was uploaded
    #bucket = event['Records'][0]['s3']['bucket']['name']
    bucket = 'json-upload-bucket' # => testing bucket.
    
    #file_key_name = event['Records'][0]['s3']['object']['key'] # Use this to make it dynamic.
    file_key_name = 'XXXXXXXXX-users.json' # This is for testing only. 
    print("File Key Name", file_key_name)
    
    response = S3_CLIENT.get_object(Bucket=bucket, Key=file_key_name)
    #print("Response : ", response)
    
    # json size 
    json_size = response['ContentLength']
    print("json_size : ", json_size)
    
    # Reading content
    content = response['Body']
    jsonObject = json.loads(content.read())
    data = jsonObject['users']
    data_len = len(jsonObject)
    #print('Length of JSON : ', data_len)
    #print('Order array length : ', len(data))
    
    if isinstance(data, list):
        data_len = len(data)
        print('Valid JSON file found')
        
        if(json_size <=  MIN_SIZE): 
            print('File meets the minimum size.')
        else:
            # determine number of files necessary
            split_into_files = math.ceil(json_size/MIN_SIZE)
            print(f'File will be split into {split_into_files} equal parts')
            
            # initialize 2D array
            split_data = [[] for i in range(0,split_into_files)]
            
            # determine indices of cutoffs in array
            starts = [math.floor(i * data_len/split_into_files) for i in range(0,split_into_files)]
            starts.append(data_len)
            
            # loop through 2D array
            for i in range(0,split_into_files):
                # loop through each range in array
                for n in range(starts[i],starts[i+1]):
                    split_data[i].append(data[n])
                
                
                print(file_key_name.split('.')[0] + '_' + str(i+1) + '.json')
                name = os.path.basename(file_key_name).split('.')[0] + '_' + str(i+1) + '.json'
                print('Name : ', name)
                folder = '/tmp/'+name
                with open(folder, 'w') as outfile:
                    
                    # restructure the json back to its original state.
                    generated_json = {
                        list(jsonObject.keys())[0] : list(jsonObject.values())[0],
                        list(jsonObject.keys())[1] : split_data[i]}
                    json.dump(generated_json, outfile, indent=4)
                    
                S3_CLIENT.upload_file(folder, bucket, name)
                    
                print('Part',str(i+1),'... completed')
            
    else:
        print("JSON is not an Array of Objects")

    return {
        'statusCode': 200,
        'body': json.dumps('JSON split completed checks s3.')
    }

在 AWS Lambda 上将大 json 文件拆分为较小的 json 文件并将其保存在 S3 上

问题描述

2 个解决方案

解决方案1
1 2020-11-05 18:11:57

解决方案2
1 已采纳 2021-07-29 09:11:18

在 AWS Lambda 上将大 json 文件拆分为较小的 json 文件并将其保存在 S3 上

问题描述

2 个解决方案

解决方案1 1 2020-11-05 18:11:57

解决方案2 1 已采纳 2021-07-29 09:11:18

解决方案1
1 2020-11-05 18:11:57

解决方案2
1 已采纳 2021-07-29 09:11:18