Cloudwatch 日志 -> Kinesis Firehose -> S3 - 不是正确的 JSON？

Question

I'm trying to build a centralized logging solution using Cloudwatch Subscription Filters to write logs to Kinesis Firehose -> S3 -> AWS Glue -> Athena.我正在尝试使用 Cloudwatch 订阅过滤器构建一个集中式日志记录解决方案，将日志写入 Kinesis Firehose -> S3 -> AWS Glue -> Athena。 I'm running into a lot of issues with data formatting.我在数据格式方面遇到了很多问题。

Initially, I was using AWS::KinesisFirehose's S3DestinationConfiguration to write to S3 and then trying to either crawl the data with AWS::Glue::Crawler or create the table manually in the Cloudformation template.最初，我使用 AWS::KinesisFirehose 的S3DestinationConfiguration写入 S3，然后尝试使用 AWS::Glue::Crawler 抓取数据或在 Cloudformation 模板中手动创建表。 I found the Crawler had a lot of trouble determining the data format on S3 (found ION instead of JSON - ION can't be queried by Athena).我发现 Crawler 在确定 S3 上的数据格式时遇到了很多麻烦（找到了 ION 而不是 JSON - Athena 无法查询 ION）。 I'm now trying ExtendedS3DestinationConfiguration which allows explicit configuration of input and output formats to force it to parquet.我现在正在尝试ExtendedS3DestinationConfiguration ，它允许显式配置输入和输出格式以强制它使用镶木地板。

Unfortunately, using this setup Kinesis Firehose returns error logs saying the input is not valid JSON.不幸的是，使用此设置 Kinesis Firehose 会返回错误日志，指出输入不是有效的 JSON。 This makes me wonder if the Cloudwatch Subscription Filter is not writing proper JSON - but there are no configuration options on this object to control the data format.这让我怀疑 Cloudwatch 订阅过滤器是否没有编写正确的 JSON - 但是此对象上没有配置选项来控制数据格式。

This is not a particularly unusual problem statement so somebody out there must have a proper configuration.这不是一个特别不寻常的问题陈述，因此必须有人进行适当的配置。 Here are some snippets of my failing configuration:以下是我失败的配置的一些片段：

ExtendedS3DestinationConfiguration:
        BucketARN: !Sub arn:aws:s3:::${S3Bucket}
        Prefix: !Sub ${S3LogsPath}year=!{timestamp:YYYY}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
        ErrorOutputPrefix: !Sub ${FailedWritePath}
        BufferingHints:
          IntervalInSeconds: 300
          SizeInMBs: 128
        CloudWatchLoggingOptions:
          Enabled: true
          LogGroupName: !Sub ${AppId}-logstream-${Environment}
          LogStreamName: logs
        CompressionFormat: UNCOMPRESSED
        RoleARN: !GetAtt FirehoseRole.Arn
        DataFormatConversionConfiguration:
          Enabled: true
          InputFormatConfiguration:
            Deserializer:
              OpenXJsonSerDe: {}
          OutputFormatConfiguration:
            Serializer:
              ParquetSerDe: {}
          SchemaConfiguration:
            CatalogId: !Ref AWS::AccountId
            DatabaseName: !Ref CentralizedLoggingDatabase
            Region: !Ref AWS::Region
            RoleARN: !GetAtt FirehoseRole.Arn
            TableName: !Ref LogsGlueTable
            VersionId: LATEST

Former config:以前的配置：

S3DestinationConfiguration:
        BucketARN: !Sub arn:aws:s3:::${S3Bucket}
        Prefix: !Sub ${S3LogsPath}year=!{timestamp:YYYY}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
        ErrorOutputPrefix: !Sub ${FailedWritePath}
        BufferingHints:
          IntervalInSeconds: 300
          SizeInMBs: 128
        CloudWatchLoggingOptions:
          Enabled: true
          LogGroupName: !Sub ${AppId}-logstream-${Environment}
          LogStreamName: logs
        CompressionFormat: GZIP
        RoleARN: !GetAtt FirehoseRole.Arn

And crawler:和爬虫：

Type: AWS::Glue::Crawler
    Properties:
      Name: !Sub ${DNSEndPoint}_logging_s3_crawler_${Environment}
      DatabaseName: !Ref CentralizedLoggingDatabase
      Description: AWS Glue crawler to crawl logs on S3
      Role: !GetAtt CentralizedLoggingGlueRole.Arn
#      Schedule: ## run on demand
#        ScheduleExpression: cron(40 * * * ? *)
      Targets:
        S3Targets:
          - Path: !Sub s3://${S3Bucket}/${S3LogsPath}
      SchemaChangePolicy:
        UpdateBehavior: UPDATE_IN_DATABASE
        DeleteBehavior: LOG
      TablePrefix: !Sub ${AppId}_${Environment}_

The error, using ExtendedS3DestinationConfiguration :错误，使用ExtendedS3DestinationConfiguration ：

"attemptsMade":1,"arrivalTimestamp":1582650068665,"lastErrorCode":"DataFormatConversion.ParseError","lastErrorMessage":"Encountered malformed JSON. Illegal character ((CTRL-CHAR, code 31)): only regular white space (\\r, \\n, \\t) is allowed between tokens\\n at [Source: com.fasterxml.jackson.databind.util.ByteBufferBackedInputStream@2ce955fc; line: 1, column: 2] "attemptsMade":1,"arrivalTimestamp":1582650068665,"lastErrorCode":"DataFormatConversion.ParseError","lastErrorMessage":"遇到格式错误的 JSON。非法字符（（CTRL-CHAR，代码 31））：只有常规空格（\\ [来源：com.fasterxml.jackson.databind.util.ByteBufferBackedInputStream@2ce955fc；行：1，列：2]

Seems like there is some configuration issue here but I cannot find it.似乎这里有一些配置问题，但我找不到它。

Answer 1

So I've just been through this in a similar scenario, but now have it working.所以我刚刚在类似的情况下经历过这个，但现在让它工作。

Firehose writes the logs to S3 compressed Base64, and as an array of JSON records. Firehose 将日志写入 S3 压缩的 Base64，并作为 JSON 记录数组。 For Athena to read the data, it needs to be decompressed and 1 JSON record per line. Athena 读取数据需要解压，每行 1 条 JSON 记录。

So create a lambda function from the blueprint : kinesis-firehose-cloudwatch-logs-processor Enable Transformations in your Firehose, and specify the above lambda function.所以从蓝图创建一个 lambda 函数：kinesis-firehose-cloudwatch-logs-processor Enable Transformations in your Firehose，并指定上面的 lambda 函数。 That will decompress, and put the json to S3 1 record per line.这将解压缩，并将 json 放入 S3 每行 1 条记录。

Creating the Athena table:创建 Athena 表：

CREATE EXTERNAL TABLE mydb.mytable(
  eventversion string COMMENT 'from deserializer', 
  useridentity struct<type:string,principalid:string,arn:string,accountid:string,invokedby:string,accesskeyid:string,username:string,sessioncontext:struct<attributes:struct<mfaauthenticated:string,creationdate:string>,sessionissuer:struct<type:string,principalid:string,arn:string,accountid:string,username:string>>> COMMENT 'from deserializer', 
  eventtime string COMMENT 'from deserializer', 
  eventsource string COMMENT 'from deserializer', 
  eventname string COMMENT 'from deserializer', 
  awsregion string COMMENT 'from deserializer', 
  sourceipaddress string COMMENT 'from deserializer', 
  useragent string COMMENT 'from deserializer', 
  errorcode string COMMENT 'from deserializer', 
  errormessage string COMMENT 'from deserializer', 
  requestparameters string COMMENT 'from deserializer', 
  responseelements string COMMENT 'from deserializer', 
  additionaleventdata string COMMENT 'from deserializer', 
  requestid string COMMENT 'from deserializer', 
  eventid string COMMENT 'from deserializer', 
  resources array<struct<arn:string,accountid:string,type:string>> COMMENT 'from deserializer', 
  eventtype string COMMENT 'from deserializer', 
  apiversion string COMMENT 'from deserializer', 
  readonly string COMMENT 'from deserializer', 
  recipientaccountid string COMMENT 'from deserializer', 
  serviceeventdetails string COMMENT 'from deserializer', 
  sharedeventid string COMMENT 'from deserializer', 
  vpcendpointid string COMMENT 'from deserializer', 
  managementevent boolean COMMENT 'from deserializer', 
  eventcategory string COMMENT 'from deserializer')
PARTITIONED BY ( 
  datehour string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'paths'='awsRegion,eventCategory,eventID,eventName,eventSource,eventTime,eventType,eventVersion,managementEvent,readOnly,recipientAccountId,requestID,requestParameters,responseElements,sourceIPAddress,userAgent,userIdentity') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://mybucket/prefix'
TBLPROPERTIES (
  'projection.datehour.format'='yyyy/MM/dd/HH', 
  'projection.datehour.interval'='1', 
  'projection.datehour.interval.unit'='HOURS', 
  'projection.datehour.range'='2021/01/01/00,NOW', 
  'projection.datehour.type'='date', 
  'projection.enabled'='true', 
  'storage.location.template'='s3://mybucket/myprefix/${datehour}'
)

Cloudwatch 日志 -> Kinesis Firehose -> S3 - 不是正确的 JSON？

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-03-08 16:04:13

Cloudwatch 日志 -&gt; Kinesis Firehose -&gt; S3 - 不是正确的 JSON？

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-03-08 16:04:13

Cloudwatch 日志 -> Kinesis Firehose -> S3 - 不是正确的 JSON？

解决方案1
2 已采纳 2021-03-08 16:04:13