简体   繁体   English

Kinesis Firehose 将 JSON 个对象放入 S3 中,没有分隔符逗号

[英]Kinesis Firehose putting JSON objects in S3 without seperator comma

Before sending the data I am using JSON.stringify to the data and it looks like this在发送数据之前,我正在使用 JSON.stringify 到数据,它看起来像这样

{"data": [{"key1": value1, "key2": value2}, {"key1": value1, "key2": value2}]}

But once it passes through AWS API Gateway and Kinesis Firehose puts it to S3 it looks like this但是一旦它通过 AWS API 网关并且 Kinesis Firehose 把它放到 S3 它看起来像这样

    {
     "key1": value1, 
     "key2": value2
    }{
     "key1": value1, 
     "key2": value2
    }

The seperator comma between the JSON objects are gone but I need it to process data properly. JSON 对象之间的分隔符逗号消失了,但我需要它来正确处理数据。

Template in the API Gateway: API网关中的模板:

#set($root = $input.path('$'))
{
    "DeliveryStreamName": "some-delivery-stream",
    "Records": [
#foreach($r in $root.data)
#set($data = "{
    ""key1"": ""$r.value1"",
    ""key2"": ""$r.value2""
}")
    {
        "Data": "$util.base64Encode($data)"
    }#if($foreach.hasNext),#end
#end
    ]
}

I had this same problem recently, and the only answers I was able to find were basically just to add line breaks ("\\n") to the end of every JSON message whenever you posted them to the Kinesis stream, or to use a raw JSON decoder method of some sort that can process concatenated JSON objects without delimiters.我最近遇到了同样的问题,我能找到的唯一答案基本上只是在您将它们发布到 Kinesis 流时在每个 JSON 消息的末尾添加换行符(“\\n”),或者使用原始某种类型的 JSON 解码器方法,可以处理没有分隔符的串联 JSON 对象。

I posted a python code solution which can be found over here on a related Stack Overflow post: https://stackoverflow.com/a/49417680/1546785我发布了一个 python 代码解决方案,可以在相关的 Stack Overflow 帖子中找到: https : //stackoverflow.com/a/49417680/1546785

Once AWS Firehose dumps the JSON objects to s3, it's perfectly possible to read the individual JSON objects from the files.一旦 AWS Firehose 将 JSON 对象转储到 s3,就完全可以从文件中读取单个 JSON 对象。

Using Python you can use the raw_decode function from the json package使用Python,您可以使用json包中的raw_decode函数

from json import JSONDecoder, JSONDecodeError
import re
import json
import boto3

NOT_WHITESPACE = re.compile(r'[^\s]')

def decode_stacked(document, pos=0, decoder=JSONDecoder()):
    while True:
        match = NOT_WHITESPACE.search(document, pos)
        if not match:
            return
        pos = match.start()

        try:
            obj, pos = decoder.raw_decode(document, pos)
        except JSONDecodeError:
            # do something sensible if there's some error
            raise
        yield obj

s3 = boto3.resource('s3')

obj = s3.Object("my-bukcet", "my-firehose-json-key.json")
file_content = obj.get()['Body'].read()
for obj in decode_stacked(file_content):
    print(json.dumps(obj))
    #  { "key1":value1,"key2":value2}
    #  { "key1":value1,"key2":value2}

source: https://stackoverflow.com/a/50384432/1771155来源: https : //stackoverflow.com/a/50384432/1771155

Using Glue / Pyspark you can use使用Glue / Pyspark,您可以使用

import json

rdd = sc.textFile("s3a://my-bucket/my-firehose-file-containing-json-objects")
df = rdd.map(lambda x: json.loads(x)).toDF()
df.show()

source: https://stackoverflow.com/a/62984450/1771155来源: https : //stackoverflow.com/a/62984450/1771155

One approach you could consider is to configure data processing for your Kinesis Firehose delivery stream by adding a Lambda function as its data processor, which would be executed before finally delivering the data to the S3 bucket.您可以考虑的一种方法是通过添加 Lambda 函数作为其数据处理器来为 Kinesis Firehose 传输流配置数据处理,该函数将在最终将数据传输到 S3 存储桶之前执行。

DeliveryStream:
  ...
  Type: AWS::KinesisFirehose::DeliveryStream
  Properties:
    DeliveryStreamType: DirectPut
    ExtendedS3DestinationConfiguration:
      ...
      BucketARN: !GetAtt MyDeliveryBucket.Arn
      ProcessingConfiguration:
        Enabled: true
        Processors:
          - Parameters:
              - ParameterName: LambdaArn
                ParameterValue: !GetAtt MyTransformDataLambdaFunction.Arn
            Type: Lambda
    ...

And in the Lambda function, ensure that '\\n' is appended to the record's JSON string, see below the Lambda function myTransformData.ts in Node.js:在 Lambda 函数中,确保将'\\n'附加到记录的 JSON 字符串中,参见下面 Node.js 中的 Lambda 函数myTransformData.ts

import {
  FirehoseTransformationEvent,
  FirehoseTransformationEventRecord,
  FirehoseTransformationHandler,
  FirehoseTransformationResult,
  FirehoseTransformationResultRecord,
} from 'aws-lambda';

const createDroppedRecord = (
  recordId: string
): FirehoseTransformationResultRecord => {
  return {
    recordId,
    result: 'Dropped',
    data: Buffer.from('').toString('base64'),
  };
};

const processData = (
  payloadStr: string,
  record: FirehoseTransformationEventRecord
) => {
  let jsonRecord;
  // ...
  // Process the orginal payload,
  // And create the record in JSON
  return jsonRecord;
};

const transformRecord = (
  record: FirehoseTransformationEventRecord
): FirehoseTransformationResultRecord => {
  try {
    const payloadStr = Buffer.from(record.data, 'base64').toString();
    const jsonRecord = processData(payloadStr, record);
    if (!jsonRecord) {
      console.error('Error creating json record');
      return createDroppedRecord(record.recordId);
    }
    return {
      recordId: record.recordId,
      result: 'Ok',
      // Ensure that '\n' is appended to the record's JSON string.
      data: Buffer.from(JSON.stringify(jsonRecord) + '\n').toString('base64'),
    };
  } catch (error) {
    console.error('Error processing record ${record.recordId}: ', error);
    return createDroppedRecord(record.recordId);
  }
};

const transformRecords = (
  event: FirehoseTransformationEvent
): FirehoseTransformationResult => {
  let records: FirehoseTransformationResultRecord[] = [];
  for (const record of event.records) {
    const transformed = transformRecord(record);
    records.push(transformed);
  }
  return { records };
};

export const handler: FirehoseTransformationHandler = async (
  event,
  _context
) => {
  const transformed = transformRecords(event);
  return transformed;
};

Once the newline delimiter is in place, AWS services such as Athena will be able to work properly with the JSON record data in the S3 bucket, not just seeing the first JSON record only .一旦换行分隔符就位,Athena 等 AWS 服务将能够正常处理 S3 存储桶中的 JSON 记录数据,而不仅仅是看到第一个 JSON 记录

please use this code to solve your issue请使用此代码来解决您的问题


__Author__ = "Soumil Nitin Shah"
import json
import boto3
import base64


class MyHasher(object):
    def __init__(self, key):
        self.key = key

    def get(self):
        keys = str(self.key).encode("UTF-8")
        keys = base64.b64encode(keys)
        keys = keys.decode("UTF-8")
        return keys

def lambda_handler(event, context):

    output = []
    for record in event['records']:

        payload = base64.b64decode(record['data'])

        """Get the payload from event bridge and just get data attr """""
        serialize_payload = str(json.loads(payload)) + "\n"
        hasherHelper = MyHasher(key=serialize_payload)
        hash = hasherHelper.get()

        output_record = {
            'recordId': record['recordId'],
            'result': 'Ok',
            'data': hash
        }
        print("output_record", output_record)

        output.append(output_record)

    return {'records': output}







暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Kinesis Firehose 将数据从 DynamoDB Steam 传输到 S3:为什么文件中的 JSON 个对象数量不同? - Kinesis Firehose delivers data from DynamoDB Steam to S3: Why the numbers of JSON objects in files is different? Kinesis firehose 写入 S3 但访问被拒绝 - Kinesis firehose writes to S3 but access denied 以 Kinesis Firehose output 格式将 DynamoDB 数据传输到 S3 - DynamoDB data to S3 in Kinesis Firehose output format 如何在 Kinesis Firehose 中的 JSON 个对象之间指定分隔符 - How to specify delimiter between JSON objects in Kinesis Firehose 按事件时间对 Kinesis firehose S3 记录进行分区 - Partition Kinesis firehose S3 records by event time 读取 Amazon Kinesis Firehose 写入 s3 的数据 stream - Reading the data written to s3 by Amazon Kinesis Firehose stream 使用来自 Kinesis Data Stream 源的 Kinesis Firehose Delivery Stream 将数据写入 S3 时出现问题 - Problem writing data to S3 with Kinesis Firehose Delivery Stream from Kinesis Data Stream source 无法触发由 Kinesis Firehose 传输流创建的 S3 对象上的事件 - Unable to trigger event on S3 object created by Kinesis Firehose delivery stream Kinesis Firehose 写入 S3 云监视订阅过滤器,但文件不可读 - Kinesis Firehose writes to S3 cloud watch subscription filter but the files are not readable 如何在不通过 Kinesis Data 的情况下直接将 KPL(Kinesis Producer Library)集成到 Kinesis firehose Stream - How to integrate KPL (Kinesis Producer Library) to Kinesis firehose directly without going through Kinesis Data Stream
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM