[英]Kinesis Firehose putting JSON objects in S3 without seperator comma
Before sending the data I am using JSON.stringify to the data and it looks like this在发送数据之前,我正在使用 JSON.stringify 到数据,它看起来像这样
{"data": [{"key1": value1, "key2": value2}, {"key1": value1, "key2": value2}]}
But once it passes through AWS API Gateway and Kinesis Firehose puts it to S3 it looks like this但是一旦它通过 AWS API 网关并且 Kinesis Firehose 把它放到 S3 它看起来像这样
{
"key1": value1,
"key2": value2
}{
"key1": value1,
"key2": value2
}
The seperator comma between the JSON objects are gone but I need it to process data properly. JSON 对象之间的分隔符逗号消失了,但我需要它来正确处理数据。
Template in the API Gateway: API网关中的模板:
#set($root = $input.path('$'))
{
"DeliveryStreamName": "some-delivery-stream",
"Records": [
#foreach($r in $root.data)
#set($data = "{
""key1"": ""$r.value1"",
""key2"": ""$r.value2""
}")
{
"Data": "$util.base64Encode($data)"
}#if($foreach.hasNext),#end
#end
]
}
I had this same problem recently, and the only answers I was able to find were basically just to add line breaks ("\\n") to the end of every JSON message whenever you posted them to the Kinesis stream, or to use a raw JSON decoder method of some sort that can process concatenated JSON objects without delimiters.我最近遇到了同样的问题,我能找到的唯一答案基本上只是在您将它们发布到 Kinesis 流时在每个 JSON 消息的末尾添加换行符(“\\n”),或者使用原始某种类型的 JSON 解码器方法,可以处理没有分隔符的串联 JSON 对象。
I posted a python code solution which can be found over here on a related Stack Overflow post: https://stackoverflow.com/a/49417680/1546785我发布了一个 python 代码解决方案,可以在相关的 Stack Overflow 帖子中找到: https : //stackoverflow.com/a/49417680/1546785
Once AWS Firehose dumps the JSON objects to s3, it's perfectly possible to read the individual JSON objects from the files.一旦 AWS Firehose 将 JSON 对象转储到 s3,就完全可以从文件中读取单个 JSON 对象。
Using Python you can use the raw_decode
function from the json
package使用Python,您可以使用
json
包中的raw_decode
函数
from json import JSONDecoder, JSONDecodeError
import re
import json
import boto3
NOT_WHITESPACE = re.compile(r'[^\s]')
def decode_stacked(document, pos=0, decoder=JSONDecoder()):
while True:
match = NOT_WHITESPACE.search(document, pos)
if not match:
return
pos = match.start()
try:
obj, pos = decoder.raw_decode(document, pos)
except JSONDecodeError:
# do something sensible if there's some error
raise
yield obj
s3 = boto3.resource('s3')
obj = s3.Object("my-bukcet", "my-firehose-json-key.json")
file_content = obj.get()['Body'].read()
for obj in decode_stacked(file_content):
print(json.dumps(obj))
# { "key1":value1,"key2":value2}
# { "key1":value1,"key2":value2}
source: https://stackoverflow.com/a/50384432/1771155来源: https : //stackoverflow.com/a/50384432/1771155
Using Glue / Pyspark you can use使用Glue / Pyspark,您可以使用
import json
rdd = sc.textFile("s3a://my-bucket/my-firehose-file-containing-json-objects")
df = rdd.map(lambda x: json.loads(x)).toDF()
df.show()
source: https://stackoverflow.com/a/62984450/1771155来源: https : //stackoverflow.com/a/62984450/1771155
One approach you could consider is to configure data processing for your Kinesis Firehose delivery stream by adding a Lambda function as its data processor, which would be executed before finally delivering the data to the S3 bucket.您可以考虑的一种方法是通过添加 Lambda 函数作为其数据处理器来为 Kinesis Firehose 传输流配置数据处理,该函数将在最终将数据传输到 S3 存储桶之前执行。
DeliveryStream:
...
Type: AWS::KinesisFirehose::DeliveryStream
Properties:
DeliveryStreamType: DirectPut
ExtendedS3DestinationConfiguration:
...
BucketARN: !GetAtt MyDeliveryBucket.Arn
ProcessingConfiguration:
Enabled: true
Processors:
- Parameters:
- ParameterName: LambdaArn
ParameterValue: !GetAtt MyTransformDataLambdaFunction.Arn
Type: Lambda
...
And in the Lambda function, ensure that '\\n'
is appended to the record's JSON string, see below the Lambda function myTransformData.ts
in Node.js:在 Lambda 函数中,确保将
'\\n'
附加到记录的 JSON 字符串中,参见下面 Node.js 中的 Lambda 函数myTransformData.ts
:
import {
FirehoseTransformationEvent,
FirehoseTransformationEventRecord,
FirehoseTransformationHandler,
FirehoseTransformationResult,
FirehoseTransformationResultRecord,
} from 'aws-lambda';
const createDroppedRecord = (
recordId: string
): FirehoseTransformationResultRecord => {
return {
recordId,
result: 'Dropped',
data: Buffer.from('').toString('base64'),
};
};
const processData = (
payloadStr: string,
record: FirehoseTransformationEventRecord
) => {
let jsonRecord;
// ...
// Process the orginal payload,
// And create the record in JSON
return jsonRecord;
};
const transformRecord = (
record: FirehoseTransformationEventRecord
): FirehoseTransformationResultRecord => {
try {
const payloadStr = Buffer.from(record.data, 'base64').toString();
const jsonRecord = processData(payloadStr, record);
if (!jsonRecord) {
console.error('Error creating json record');
return createDroppedRecord(record.recordId);
}
return {
recordId: record.recordId,
result: 'Ok',
// Ensure that '\n' is appended to the record's JSON string.
data: Buffer.from(JSON.stringify(jsonRecord) + '\n').toString('base64'),
};
} catch (error) {
console.error('Error processing record ${record.recordId}: ', error);
return createDroppedRecord(record.recordId);
}
};
const transformRecords = (
event: FirehoseTransformationEvent
): FirehoseTransformationResult => {
let records: FirehoseTransformationResultRecord[] = [];
for (const record of event.records) {
const transformed = transformRecord(record);
records.push(transformed);
}
return { records };
};
export const handler: FirehoseTransformationHandler = async (
event,
_context
) => {
const transformed = transformRecords(event);
return transformed;
};
Once the newline delimiter is in place, AWS services such as Athena will be able to work properly with the JSON record data in the S3 bucket, not just seeing the first JSON record only .一旦换行分隔符就位,Athena 等 AWS 服务将能够正常处理 S3 存储桶中的 JSON 记录数据,而不仅仅是看到第一个 JSON 记录。
please use this code to solve your issue请使用此代码来解决您的问题
__Author__ = "Soumil Nitin Shah"
import json
import boto3
import base64
class MyHasher(object):
def __init__(self, key):
self.key = key
def get(self):
keys = str(self.key).encode("UTF-8")
keys = base64.b64encode(keys)
keys = keys.decode("UTF-8")
return keys
def lambda_handler(event, context):
output = []
for record in event['records']:
payload = base64.b64decode(record['data'])
"""Get the payload from event bridge and just get data attr """""
serialize_payload = str(json.loads(payload)) + "\n"
hasherHelper = MyHasher(key=serialize_payload)
hash = hasherHelper.get()
output_record = {
'recordId': record['recordId'],
'result': 'Ok',
'data': hash
}
print("output_record", output_record)
output.append(output_record)
return {'records': output}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.