简体   繁体   中英

Writing data to timestreamDb from AWS Glue

I'm trying to use glue streaming and write data to AWS TimestreamDB but I'm having a hard time in configuring the JDBC connection.

Steps I'm following are below and the documentation link: https://docs.aws.amazon.com/timestream/latest/developerguide/JDBC.configuring.html

  1. I'm uploading the jar to S3. There are multiple jars here and I tried with each one of it. https://github.com/awslabs/amazon-timestream-driver-jdbc/releases
  2. In the glue job I'm pointing the jar lib path to the above s3 location
  3. In the job script I'm trying to read from timestream using both spark/ glue with the below code but its not working. Can someone explain what I'm doing wrong here

This is my code:

url = jdbc:timestream://AccessKeyId=<myAccessKeyId>;SecretAccessKey=<mySecretAccessKey>;SessionToken=<mySessionToken>;Region=us-east-1

source_df = sparkSession.read.format("jdbc").option("url",url).option("dbtable","IoT").option("driver","software.amazon.timestream.jdbc.TimestreamDriver").load()

datasink1 = glueContext.write_dynamic_frame.from_options(frame = applymapping0, connection_type = "jdbc", connection_options = {"url":url,"driver":"software.amazon.timestream.jdbc.TimestreamDriver", database = "CovidTestDb", dbtable = "CovidTestTable"}, transformation_ctx = "datasink1")

To this date (April 2022) there is not support for write operations using timestream's jdbc driver (reviewed the code and saw a bunch of no write support exceptions). It is possible to read data from timestream using glue though. Following steps worked for me:

  • Upload timestream-query and timestream-jdbc to an S3 bucket that you can reference in your glue script
  • Ensure that the IAM role for the script has access to read operations to the timestream database and table
  • You don't need to use the access key and secret parameters in the jdbc url, using something like jdbc:timestream://Region=<timestream-db-region> should be enough
  • Specify the driver and fetchsize options option("driver","software.amazon.timestream.jdbc.TimestreamDriver") option("fetchsize", "100") (tweak the fetchsize according to your needs)

Following is a complete example of reading a dataframe from timestream:

val df = sparkSession.read.format("jdbc")
      .option("url", "jdbc:timestream://Region=us-east-1")
      .option("driver","software.amazon.timestream.jdbc.TimestreamDriver")
      // optionally add a query to narrow the data to fetch
      .option("query", "select * from db.tbl where time between ago(15m) and now()")
      .option("fetchsize", "100")
      .load()
df.write.format("console").save()

Hope this helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM