简体   繁体   中英

What’s the best way to upload data with boto3 to Dynamodb, directly or using s3 and data pipelines instead?

I have a JSON file with large size and I would like to know if it is better to upload this information directly to Dynamodb using boto3 or instead it is better to upload this on s3 first and then using data pipeline, upload this to Dynamodb?

Here is a few samples data:

Sample1:

{  
   "updated":{  
      "n":"20181226"
   },
   "periodo":{  
      "n":"20180823"
   },
   "tipos":{  
      "m":{  
         "Disponible":{  
            "m":{  
               "total":{  
                  "n":"200"
               },
               "Saldos de Cuentas de Ahorro":{  
                  "n":"300"
               }
            }
         }
      }
   },
   "mediana_disponible":{  
      "n":"588"
   },
   "mediana_ingreso":{  
      "n":"658"
   },
   "mediana_egreso":{  
      "n":"200"
   },
   "documento":{  
      "s":"2-2"
   }
}

For this sample, this is only one record and in average there are 68 millons of records and the file size is 70Gb.

Sample2:

{  
   "updated":{  
      "n":"20190121"
   },
   "zonas":{  
      "s":"123"
   },
   "tipo_doc":{  
      "n":"3123"
   },
   "cods_sai":{  
      "s":"3,234234,234234"
   },
   "cods_cb":{  
      "s":"234234,5435,45"
   },
   "cods_atm":{  
      "s":"54,45,345;345,5345,435"
   },
   "num_doc":{  
      "n":"345"
   },
   "cods_mf":{  
      "s":"NNN"
   },
   "cods_pac":{  
      "s":"NNN"
   }
}

For this sample, this is only one record and in average there are 7 millons of records and the file size is 10Gb.

Thanks in advance

For your situation I would use AWS Data Pipeline to import your Json data files into DynamoDB from S3. There are many examples provided by AWS and by others on the Internet.

Your use case for me is just on the border between simply writing a Python import script and deploying Data Pipeline. Since your data is clean, deploying a pipeline will be very easy.

I would definitely copy your data to S3 first and then process your data from S3. The primary reason is the unreliability of the public Internet for this large of data.

If this task will be repeated over time, the I would definitely use AWS Data Pipeline.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM