What’s the best way to upload data with boto3 to Dynamodb, directly or using s3 and data pipelines instead?

Question

I have a JSON file with large size and I would like to know if it is better to upload this information directly to Dynamodb using boto3 or instead it is better to upload this on s3 first and then using data pipeline, upload this to Dynamodb?

Here is a few samples data:

Sample1:

{  
   "updated":{  
      "n":"20181226"
   },
   "periodo":{  
      "n":"20180823"
   },
   "tipos":{  
      "m":{  
         "Disponible":{  
            "m":{  
               "total":{  
                  "n":"200"
               },
               "Saldos de Cuentas de Ahorro":{  
                  "n":"300"
               }
            }
         }
      }
   },
   "mediana_disponible":{  
      "n":"588"
   },
   "mediana_ingreso":{  
      "n":"658"
   },
   "mediana_egreso":{  
      "n":"200"
   },
   "documento":{  
      "s":"2-2"
   }
}

For this sample, this is only one record and in average there are 68 millons of records and the file size is 70Gb.

Sample2:

{  
   "updated":{  
      "n":"20190121"
   },
   "zonas":{  
      "s":"123"
   },
   "tipo_doc":{  
      "n":"3123"
   },
   "cods_sai":{  
      "s":"3,234234,234234"
   },
   "cods_cb":{  
      "s":"234234,5435,45"
   },
   "cods_atm":{  
      "s":"54,45,345;345,5345,435"
   },
   "num_doc":{  
      "n":"345"
   },
   "cods_mf":{  
      "s":"NNN"
   },
   "cods_pac":{  
      "s":"NNN"
   }
}

For this sample, this is only one record and in average there are 7 millons of records and the file size is 10Gb.

Thanks in advance

Answer 1

For your situation I would use AWS Data Pipeline to import your Json data files into DynamoDB from S3. There are many examples provided by AWS and by others on the Internet.

Your use case for me is just on the border between simply writing a Python import script and deploying Data Pipeline. Since your data is clean, deploying a pipeline will be very easy.

I would definitely copy your data to S3 first and then process your data from S3. The primary reason is the unreliability of the public Internet for this large of data.

If this task will be repeated over time, the I would definitely use AWS Data Pipeline.

What’s the best way to upload data with boto3 to Dynamodb, directly or using s3 and data pipelines instead?

Question

1 answers

solution1
0 ACCPTED 2019-01-25 20:34:40

What’s the best way to upload data with boto3 to Dynamodb, directly or using s3 and data pipelines instead?

Question

1 answers

solution1 0 ACCPTED 2019-01-25 20:34:40

solution1
0 ACCPTED 2019-01-25 20:34:40