简体   繁体   中英

Import a very large txt file to Dynamobd

I have a huge txt file and I need to put it on DynamoDB. the file struct is:

223344|blue and orange|Red|16/12/2022

223344|blue and orange|Red|16/12/2022...

This file has more than 200M lines

I have tried to convert it on json file using this code bellow:

import json

with open('mini_data.txt', 'r') as f_in:
    for line in f_in:
        line = line.strip().split('|')        
        filename = 'smini_final_data.json'
        result = {"fild1": line[0], "field2": line[1], "field3": str(line[2]).replace(" ",""),"field4":line[3]}
        with open(filename, "r") as file:
            data = json.load(file)
        data.append(result)
        with open(filename, "w") as file:
            json.dump(data, file)

But this isn't efficient and it's only the first part of the job ( convert data to Json ), after this I need put the Json in dynamoDB.

I have used this code (it's look good):

    def insert(self):
        if not self.dynamodb:
            self.dynamodb = boto3.resource(
                'dynamodb', endpoint_url="http://localhost:8000")
        table = self.dynamodb.Table('fruits')

        json_file = open("final_data.json")
        orange = json.load(json_file, parse_float = decimal.Decimal)

        with table.batch_writer() as batch:
            for fruit in orange:
                fild1 = fruit['fild1']
                fild2 = fruit['fild2']
                fild3= fruit['fild3']
                fild4 = fruit['fild4']

                batch.put_item(
                    Item={
                        'fild1':fild1,
                        'fild2':fild2,
                        'fild3':fild3,
                        'fild4':fild4
                    }
                )

So, does anyone, have some suggestions to process this txt more efficiently?

Thanks

The step of converting from delimited text to JSON seems unnecessary in this case. The way you've written it requires reopening and rewriting the JSON file for each line of your delimited text file. That I/O overhead repeated 200M times can really slow things down.

I suggest going straight from your delimited text to DynamoDB. It might look something like this:

dynamodb = boto3.resource(
    'dynamodb', endpoint_url="http://localhost:8000")
table = self.dynamodb.Table('fruits')

with table.batch_writer() as batch:
    with open('mini_data.txt', 'r') as f_in:
        for line in f_in:
            line = line.strip().split('|')
            batch.put_item(
                Item={
                    'fild1':line[0],
                    'fild2':line[1],
                    'fild3':str(line[2]).replace(" ",""),
                    'fild4':line[3]
                }
            )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM