簡體   English   中英

Cloudwatch 日志 -> Kinesis Firehose -> S3 - 不是正確的 JSON?

[英]Cloudwatch Logs -> Kinesis Firehose -> S3 - not proper JSON?

我正在嘗試使用 Cloudwatch 訂閱過濾器構建一個集中式日志記錄解決方案,將日志寫入 Kinesis Firehose -> S3 -> AWS Glue -> Athena。 我在數據格式方面遇到了很多問題。

最初,我使用 AWS::KinesisFirehose 的S3DestinationConfiguration寫入 S3,然后嘗試使用 AWS::Glue::Crawler 抓取數據或在 Cloudformation 模板中手動創建表。 我發現 Crawler 在確定 S3 上的數據格式時遇到了很多麻煩(找到了 ION 而不是 JSON - Athena 無法查詢 ION)。 我現在正在嘗試ExtendedS3DestinationConfiguration ,它允許顯式配置輸入和輸出格式以強制它使用鑲木地板。

不幸的是,使用此設置 Kinesis Firehose 會返回錯誤日志,指出輸入不是有效的 JSON。 這讓我懷疑 Cloudwatch 訂閱過濾器是否沒有編寫正確的 JSON - 但是此對象上沒有配置選項來控制數據格式。

這不是一個特別不尋常的問題陳述,因此必須有人進行適當的配置。 以下是我失敗的配置的一些片段:

ExtendedS3DestinationConfiguration:
        BucketARN: !Sub arn:aws:s3:::${S3Bucket}
        Prefix: !Sub ${S3LogsPath}year=!{timestamp:YYYY}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
        ErrorOutputPrefix: !Sub ${FailedWritePath}
        BufferingHints:
          IntervalInSeconds: 300
          SizeInMBs: 128
        CloudWatchLoggingOptions:
          Enabled: true
          LogGroupName: !Sub ${AppId}-logstream-${Environment}
          LogStreamName: logs
        CompressionFormat: UNCOMPRESSED
        RoleARN: !GetAtt FirehoseRole.Arn
        DataFormatConversionConfiguration:
          Enabled: true
          InputFormatConfiguration:
            Deserializer:
              OpenXJsonSerDe: {}
          OutputFormatConfiguration:
            Serializer:
              ParquetSerDe: {}
          SchemaConfiguration:
            CatalogId: !Ref AWS::AccountId
            DatabaseName: !Ref CentralizedLoggingDatabase
            Region: !Ref AWS::Region
            RoleARN: !GetAtt FirehoseRole.Arn
            TableName: !Ref LogsGlueTable
            VersionId: LATEST

以前的配置:

S3DestinationConfiguration:
        BucketARN: !Sub arn:aws:s3:::${S3Bucket}
        Prefix: !Sub ${S3LogsPath}year=!{timestamp:YYYY}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
        ErrorOutputPrefix: !Sub ${FailedWritePath}
        BufferingHints:
          IntervalInSeconds: 300
          SizeInMBs: 128
        CloudWatchLoggingOptions:
          Enabled: true
          LogGroupName: !Sub ${AppId}-logstream-${Environment}
          LogStreamName: logs
        CompressionFormat: GZIP
        RoleARN: !GetAtt FirehoseRole.Arn

和爬蟲:

Type: AWS::Glue::Crawler
    Properties:
      Name: !Sub ${DNSEndPoint}_logging_s3_crawler_${Environment}
      DatabaseName: !Ref CentralizedLoggingDatabase
      Description: AWS Glue crawler to crawl logs on S3
      Role: !GetAtt CentralizedLoggingGlueRole.Arn
#      Schedule: ## run on demand
#        ScheduleExpression: cron(40 * * * ? *)
      Targets:
        S3Targets:
          - Path: !Sub s3://${S3Bucket}/${S3LogsPath}
      SchemaChangePolicy:
        UpdateBehavior: UPDATE_IN_DATABASE
        DeleteBehavior: LOG
      TablePrefix: !Sub ${AppId}_${Environment}_

錯誤,使用ExtendedS3DestinationConfiguration

"attemptsMade":1,"arrivalTimestamp":1582650068665,"lastErrorCode":"DataFormatConversion.ParseError","lastErrorMessage":"遇到格式錯誤的 JSON。非法字符((CTRL-CHAR,代碼 31)):只有常規空格(\\ [來源:com.fasterxml.jackson.databind.util.ByteBufferBackedInputStream@2ce955fc;行:1,列:2]

似乎這里有一些配置問題,但我找不到它。

所以我剛剛在類似的情況下經歷過這個,但現在讓它工作。

Firehose 將日志寫入 S3 壓縮的 Base64,並作為 JSON 記錄數組。 Athena 讀取數據需要解壓,每行 1 條 JSON 記錄。

所以從藍圖創建一個 lambda 函數:kinesis-firehose-cloudwatch-logs-processor Enable Transformations in your Firehose,並指定上面的 lambda 函數。 這將解壓縮,並將 json 放入 S3 每行 1 條記錄。

創建 Athena 表:

CREATE EXTERNAL TABLE mydb.mytable(
  eventversion string COMMENT 'from deserializer', 
  useridentity struct<type:string,principalid:string,arn:string,accountid:string,invokedby:string,accesskeyid:string,username:string,sessioncontext:struct<attributes:struct<mfaauthenticated:string,creationdate:string>,sessionissuer:struct<type:string,principalid:string,arn:string,accountid:string,username:string>>> COMMENT 'from deserializer', 
  eventtime string COMMENT 'from deserializer', 
  eventsource string COMMENT 'from deserializer', 
  eventname string COMMENT 'from deserializer', 
  awsregion string COMMENT 'from deserializer', 
  sourceipaddress string COMMENT 'from deserializer', 
  useragent string COMMENT 'from deserializer', 
  errorcode string COMMENT 'from deserializer', 
  errormessage string COMMENT 'from deserializer', 
  requestparameters string COMMENT 'from deserializer', 
  responseelements string COMMENT 'from deserializer', 
  additionaleventdata string COMMENT 'from deserializer', 
  requestid string COMMENT 'from deserializer', 
  eventid string COMMENT 'from deserializer', 
  resources array<struct<arn:string,accountid:string,type:string>> COMMENT 'from deserializer', 
  eventtype string COMMENT 'from deserializer', 
  apiversion string COMMENT 'from deserializer', 
  readonly string COMMENT 'from deserializer', 
  recipientaccountid string COMMENT 'from deserializer', 
  serviceeventdetails string COMMENT 'from deserializer', 
  sharedeventid string COMMENT 'from deserializer', 
  vpcendpointid string COMMENT 'from deserializer', 
  managementevent boolean COMMENT 'from deserializer', 
  eventcategory string COMMENT 'from deserializer')
PARTITIONED BY ( 
  datehour string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'paths'='awsRegion,eventCategory,eventID,eventName,eventSource,eventTime,eventType,eventVersion,managementEvent,readOnly,recipientAccountId,requestID,requestParameters,responseElements,sourceIPAddress,userAgent,userIdentity') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://mybucket/prefix'
TBLPROPERTIES (
  'projection.datehour.format'='yyyy/MM/dd/HH', 
  'projection.datehour.interval'='1', 
  'projection.datehour.interval.unit'='HOURS', 
  'projection.datehour.range'='2021/01/01/00,NOW', 
  'projection.datehour.type'='date', 
  'projection.enabled'='true', 
  'storage.location.template'='s3://mybucket/myprefix/${datehour}'
)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM