Kinesis Firehose 将 JSON 个对象放入 S3 中，没有分隔符逗号

Question

Before sending the data I am using JSON.stringify to the data and it looks like this在发送数据之前，我正在使用 JSON.stringify 到数据，它看起来像这样

{"data": [{"key1": value1, "key2": value2}, {"key1": value1, "key2": value2}]}

But once it passes through AWS API Gateway and Kinesis Firehose puts it to S3 it looks like this但是一旦它通过 AWS API 网关并且 Kinesis Firehose 把它放到 S3 它看起来像这样

    {
     "key1": value1, 
     "key2": value2
    }{
     "key1": value1, 
     "key2": value2
    }

The seperator comma between the JSON objects are gone but I need it to process data properly. JSON 对象之间的分隔符逗号消失了，但我需要它来正确处理数据。

Template in the API Gateway: API网关中的模板：

#set($root = $input.path('$'))
{
    "DeliveryStreamName": "some-delivery-stream",
    "Records": [
#foreach($r in $root.data)
#set($data = "{
    ""key1"": ""$r.value1"",
    ""key2"": ""$r.value2""
}")
    {
        "Data": "$util.base64Encode($data)"
    }#if($foreach.hasNext),#end
#end
    ]
}

Answer 1

I had this same problem recently, and the only answers I was able to find were basically just to add line breaks ("\\n") to the end of every JSON message whenever you posted them to the Kinesis stream, or to use a raw JSON decoder method of some sort that can process concatenated JSON objects without delimiters.我最近遇到了同样的问题，我能找到的唯一答案基本上只是在您将它们发布到 Kinesis 流时在每个 JSON 消息的末尾添加换行符（“\\n”），或者使用原始某种类型的 JSON 解码器方法，可以处理没有分隔符的串联 JSON 对象。

I posted a python code solution which can be found over here on a related Stack Overflow post: https://stackoverflow.com/a/49417680/1546785我发布了一个 python 代码解决方案，可以在相关的 Stack Overflow 帖子中找到： https : //stackoverflow.com/a/49417680/1546785

Answer 2

Once AWS Firehose dumps the JSON objects to s3, it's perfectly possible to read the individual JSON objects from the files.一旦 AWS Firehose 将 JSON 对象转储到 s3，就完全可以从文件中读取单个 JSON 对象。

Using Python you can use the raw_decode function from the json package使用Python，您可以使用json包中的raw_decode函数

from json import JSONDecoder, JSONDecodeError
import re
import json
import boto3

NOT_WHITESPACE = re.compile(r'[^\s]')

def decode_stacked(document, pos=0, decoder=JSONDecoder()):
    while True:
        match = NOT_WHITESPACE.search(document, pos)
        if not match:
            return
        pos = match.start()

        try:
            obj, pos = decoder.raw_decode(document, pos)
        except JSONDecodeError:
            # do something sensible if there's some error
            raise
        yield obj

s3 = boto3.resource('s3')

obj = s3.Object("my-bukcet", "my-firehose-json-key.json")
file_content = obj.get()['Body'].read()
for obj in decode_stacked(file_content):
    print(json.dumps(obj))
    #  { "key1":value1,"key2":value2}
    #  { "key1":value1,"key2":value2}

source: https://stackoverflow.com/a/50384432/1771155来源： https : //stackoverflow.com/a/50384432/1771155

Using Glue / Pyspark you can use使用Glue / Pyspark，您可以使用

import json

rdd = sc.textFile("s3a://my-bucket/my-firehose-file-containing-json-objects")
df = rdd.map(lambda x: json.loads(x)).toDF()
df.show()

source: https://stackoverflow.com/a/62984450/1771155来源： https : //stackoverflow.com/a/62984450/1771155

Answer 3

One approach you could consider is to configure data processing for your Kinesis Firehose delivery stream by adding a Lambda function as its data processor, which would be executed before finally delivering the data to the S3 bucket.您可以考虑的一种方法是通过添加 Lambda 函数作为其数据处理器来为 Kinesis Firehose 传输流配置数据处理，该函数将在最终将数据传输到 S3 存储桶之前执行。

DeliveryStream:
  ...
  Type: AWS::KinesisFirehose::DeliveryStream
  Properties:
    DeliveryStreamType: DirectPut
    ExtendedS3DestinationConfiguration:
      ...
      BucketARN: !GetAtt MyDeliveryBucket.Arn
      ProcessingConfiguration:
        Enabled: true
        Processors:
          - Parameters:
              - ParameterName: LambdaArn
                ParameterValue: !GetAtt MyTransformDataLambdaFunction.Arn
            Type: Lambda
    ...

And in the Lambda function, ensure that '\\n' is appended to the record's JSON string, see below the Lambda function myTransformData.ts in Node.js:在 Lambda 函数中，确保将'\\n'附加到记录的 JSON 字符串中，参见下面 Node.js 中的 Lambda 函数myTransformData.ts ：

import {
  FirehoseTransformationEvent,
  FirehoseTransformationEventRecord,
  FirehoseTransformationHandler,
  FirehoseTransformationResult,
  FirehoseTransformationResultRecord,
} from 'aws-lambda';

const createDroppedRecord = (
  recordId: string
): FirehoseTransformationResultRecord => {
  return {
    recordId,
    result: 'Dropped',
    data: Buffer.from('').toString('base64'),
  };
};

const processData = (
  payloadStr: string,
  record: FirehoseTransformationEventRecord
) => {
  let jsonRecord;
  // ...
  // Process the orginal payload,
  // And create the record in JSON
  return jsonRecord;
};

const transformRecord = (
  record: FirehoseTransformationEventRecord
): FirehoseTransformationResultRecord => {
  try {
    const payloadStr = Buffer.from(record.data, 'base64').toString();
    const jsonRecord = processData(payloadStr, record);
    if (!jsonRecord) {
      console.error('Error creating json record');
      return createDroppedRecord(record.recordId);
    }
    return {
      recordId: record.recordId,
      result: 'Ok',
      // Ensure that '\n' is appended to the record's JSON string.
      data: Buffer.from(JSON.stringify(jsonRecord) + '\n').toString('base64'),
    };
  } catch (error) {
    console.error('Error processing record ${record.recordId}: ', error);
    return createDroppedRecord(record.recordId);
  }
};

const transformRecords = (
  event: FirehoseTransformationEvent
): FirehoseTransformationResult => {
  let records: FirehoseTransformationResultRecord[] = [];
  for (const record of event.records) {
    const transformed = transformRecord(record);
    records.push(transformed);
  }
  return { records };
};

export const handler: FirehoseTransformationHandler = async (
  event,
  _context
) => {
  const transformed = transformRecords(event);
  return transformed;
};

Once the newline delimiter is in place, AWS services such as Athena will be able to work properly with the JSON record data in the S3 bucket, not just seeing the first JSON record only .一旦换行分隔符就位，Athena 等 AWS 服务将能够正常处理 S3 存储桶中的 JSON 记录数据，而不仅仅是看到第一个 JSON 记录。

Answer 4

please use this code to solve your issue请使用此代码来解决您的问题


__Author__ = "Soumil Nitin Shah"
import json
import boto3
import base64


class MyHasher(object):
    def __init__(self, key):
        self.key = key

    def get(self):
        keys = str(self.key).encode("UTF-8")
        keys = base64.b64encode(keys)
        keys = keys.decode("UTF-8")
        return keys

def lambda_handler(event, context):

    output = []
    for record in event['records']:

        payload = base64.b64decode(record['data'])

        """Get the payload from event bridge and just get data attr """""
        serialize_payload = str(json.loads(payload)) + "\n"
        hasherHelper = MyHasher(key=serialize_payload)
        hash = hasherHelper.get()

        output_record = {
            'recordId': record['recordId'],
            'result': 'Ok',
            'data': hash
        }
        print("output_record", output_record)

        output.append(output_record)

    return {'records': output}

Kinesis Firehose 将 JSON 个对象放入 S3 中，没有分隔符逗号

问题描述

4 个解决方案

解决方案1
6 已采纳 2018-03-22 05:43:09

解决方案2
0 2020-07-19 19:11:11

解决方案3
0 2020-12-11 10:38:37

解决方案4
0 2022-03-19 16:35:11

Kinesis Firehose 将 JSON 个对象放入 S3 中，没有分隔符逗号

问题描述

4 个解决方案

解决方案1 6 已采纳 2018-03-22 05:43:09

解决方案2 0 2020-07-19 19:11:11

解决方案3 0 2020-12-11 10:38:37

解决方案4 0 2022-03-19 16:35:11

解决方案1
6 已采纳 2018-03-22 05:43:09

解决方案2
0 2020-07-19 19:11:11

解决方案3
0 2020-12-11 10:38:37

解决方案4
0 2022-03-19 16:35:11