简体   繁体   中英

How to send data from Lambda (Python) to Redshift through Kinesis

I have a lambda function in Python that makes some stuff and returns some values that need to be inserted in Redshift. In the lambda, I push the values to Kinesis, which copy them in S3 and after that in Redshift.

The values in lambda are obtained in a string as follows:

final_string = 'a;b;d;c'

Each letter is the value for a different column in the table in Redshift, hence the delimeter is ';'. Then I push the data to Kinesis Stream with:

put_response = kinesis_client.put_record(StreamName = 'PixelTrack',
                                            Data=json.dumps(final_string),
                                           PartitionKey='first')

Then, the kinesis stream feeds a Kinesis Firehose Stream. The file that is generated in S3 with Kinesis Firehose is like (including quotes in the file):

"a;b;c;d;c" 

And finally I copy the data to redshift with the following statement (configured in Kinesis firehose):

copy table
from blabla
BLANKSASNULL 
DELIMITER ';' 
EMPTYASNULL 
NULL AS 'null' 
ESCAPE 
FILLRECORD;

I have managed to make it work and getting the values in Redshift when just one result is buffered in Kinesis (creating a new column in Redshift, though). So, when just one lambda have been executed during the buffer time, Redshift table looks like:

  A        B         C         D     no_info_column
  "a       b         c         d"        <null>

The problem comes when I execute the lambda several times, because I get a file in S3 with the following text in it:

"a,b,c,d" "a1,b1,c1,d1"

And I get in Redshift the error Extra column(s) found because the copy statement is unable to find the rows separation.

I have tried the following stuff without success:

  • Returning the string in lambda
  • Searching how to set a row delimeter in copy ( SO question )
  • Converting a list to json instead of a string. Then I had problems with the brackets opening the list
  • Using REMOVEQUOTES in copy statement

My original question were going to be: "How to copy from s3 to redshift with different rows separated by double quotes", but maybe the problem is in my first approach or whatever, so I have decided to make the question a little bit more broad.

So, how could I solve this?

If you wish to send streaming data to Amazon Redshift, you can use Amazon Kinesis Data Firehose. It has in-built buffering of data based on size (MB) or time (seconds) for batching writes to Amazon Redshift.

You are correct that it is not ideal to perform small INSERT operations on Redshift. It is much better to load data in bulk. So, if you need to load data on a continuous basis, Kinesis Data Firehose gives the best mix of performance.

You mention that "the kinesis stream feeds a Kinesis Firehose Stream". Feel free to write directly from the AWS Lambda function to Kinesis Data Firehose.

There is currently a feature in development (available in preview) to ingest directly from a kinesis stream to redshift: https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion.html

You have to refresh view based on your needs. Might be worth to have a look.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM