[英]AWS IAM Policies to connect AWS Cloudwatch Logs, Kinesis Firehose, S3 and ElasticSearch
[英]Cloudwatch Logs -> Kinesis Firehose -> S3 - not proper JSON?
我正在嘗試使用 Cloudwatch 訂閱過濾器構建一個集中式日志記錄解決方案,將日志寫入 Kinesis Firehose -> S3 -> AWS Glue -> Athena。 我在數據格式方面遇到了很多問題。
最初,我使用 AWS::KinesisFirehose 的S3DestinationConfiguration
寫入 S3,然后嘗試使用 AWS::Glue::Crawler 抓取數據或在 Cloudformation 模板中手動創建表。 我發現 Crawler 在確定 S3 上的數據格式時遇到了很多麻煩(找到了 ION 而不是 JSON - Athena 無法查詢 ION)。 我現在正在嘗試ExtendedS3DestinationConfiguration
,它允許顯式配置輸入和輸出格式以強制它使用鑲木地板。
不幸的是,使用此設置 Kinesis Firehose 會返回錯誤日志,指出輸入不是有效的 JSON。 這讓我懷疑 Cloudwatch 訂閱過濾器是否沒有編寫正確的 JSON - 但是此對象上沒有配置選項來控制數據格式。
這不是一個特別不尋常的問題陳述,因此必須有人進行適當的配置。 以下是我失敗的配置的一些片段:
ExtendedS3DestinationConfiguration:
BucketARN: !Sub arn:aws:s3:::${S3Bucket}
Prefix: !Sub ${S3LogsPath}year=!{timestamp:YYYY}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
ErrorOutputPrefix: !Sub ${FailedWritePath}
BufferingHints:
IntervalInSeconds: 300
SizeInMBs: 128
CloudWatchLoggingOptions:
Enabled: true
LogGroupName: !Sub ${AppId}-logstream-${Environment}
LogStreamName: logs
CompressionFormat: UNCOMPRESSED
RoleARN: !GetAtt FirehoseRole.Arn
DataFormatConversionConfiguration:
Enabled: true
InputFormatConfiguration:
Deserializer:
OpenXJsonSerDe: {}
OutputFormatConfiguration:
Serializer:
ParquetSerDe: {}
SchemaConfiguration:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref CentralizedLoggingDatabase
Region: !Ref AWS::Region
RoleARN: !GetAtt FirehoseRole.Arn
TableName: !Ref LogsGlueTable
VersionId: LATEST
以前的配置:
S3DestinationConfiguration:
BucketARN: !Sub arn:aws:s3:::${S3Bucket}
Prefix: !Sub ${S3LogsPath}year=!{timestamp:YYYY}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
ErrorOutputPrefix: !Sub ${FailedWritePath}
BufferingHints:
IntervalInSeconds: 300
SizeInMBs: 128
CloudWatchLoggingOptions:
Enabled: true
LogGroupName: !Sub ${AppId}-logstream-${Environment}
LogStreamName: logs
CompressionFormat: GZIP
RoleARN: !GetAtt FirehoseRole.Arn
和爬蟲:
Type: AWS::Glue::Crawler
Properties:
Name: !Sub ${DNSEndPoint}_logging_s3_crawler_${Environment}
DatabaseName: !Ref CentralizedLoggingDatabase
Description: AWS Glue crawler to crawl logs on S3
Role: !GetAtt CentralizedLoggingGlueRole.Arn
# Schedule: ## run on demand
# ScheduleExpression: cron(40 * * * ? *)
Targets:
S3Targets:
- Path: !Sub s3://${S3Bucket}/${S3LogsPath}
SchemaChangePolicy:
UpdateBehavior: UPDATE_IN_DATABASE
DeleteBehavior: LOG
TablePrefix: !Sub ${AppId}_${Environment}_
錯誤,使用ExtendedS3DestinationConfiguration
:
"attemptsMade":1,"arrivalTimestamp":1582650068665,"lastErrorCode":"DataFormatConversion.ParseError","lastErrorMessage":"遇到格式錯誤的 JSON。非法字符((CTRL-CHAR,代碼 31)):只有常規空格(\\ [來源:com.fasterxml.jackson.databind.util.ByteBufferBackedInputStream@2ce955fc;行:1,列:2]
似乎這里有一些配置問題,但我找不到它。
所以我剛剛在類似的情況下經歷過這個,但現在讓它工作。
Firehose 將日志寫入 S3 壓縮的 Base64,並作為 JSON 記錄數組。 Athena 讀取數據需要解壓,每行 1 條 JSON 記錄。
所以從藍圖創建一個 lambda 函數:kinesis-firehose-cloudwatch-logs-processor Enable Transformations in your Firehose,並指定上面的 lambda 函數。 這將解壓縮,並將 json 放入 S3 每行 1 條記錄。
創建 Athena 表:
CREATE EXTERNAL TABLE mydb.mytable(
eventversion string COMMENT 'from deserializer',
useridentity struct<type:string,principalid:string,arn:string,accountid:string,invokedby:string,accesskeyid:string,username:string,sessioncontext:struct<attributes:struct<mfaauthenticated:string,creationdate:string>,sessionissuer:struct<type:string,principalid:string,arn:string,accountid:string,username:string>>> COMMENT 'from deserializer',
eventtime string COMMENT 'from deserializer',
eventsource string COMMENT 'from deserializer',
eventname string COMMENT 'from deserializer',
awsregion string COMMENT 'from deserializer',
sourceipaddress string COMMENT 'from deserializer',
useragent string COMMENT 'from deserializer',
errorcode string COMMENT 'from deserializer',
errormessage string COMMENT 'from deserializer',
requestparameters string COMMENT 'from deserializer',
responseelements string COMMENT 'from deserializer',
additionaleventdata string COMMENT 'from deserializer',
requestid string COMMENT 'from deserializer',
eventid string COMMENT 'from deserializer',
resources array<struct<arn:string,accountid:string,type:string>> COMMENT 'from deserializer',
eventtype string COMMENT 'from deserializer',
apiversion string COMMENT 'from deserializer',
readonly string COMMENT 'from deserializer',
recipientaccountid string COMMENT 'from deserializer',
serviceeventdetails string COMMENT 'from deserializer',
sharedeventid string COMMENT 'from deserializer',
vpcendpointid string COMMENT 'from deserializer',
managementevent boolean COMMENT 'from deserializer',
eventcategory string COMMENT 'from deserializer')
PARTITIONED BY (
datehour string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='awsRegion,eventCategory,eventID,eventName,eventSource,eventTime,eventType,eventVersion,managementEvent,readOnly,recipientAccountId,requestID,requestParameters,responseElements,sourceIPAddress,userAgent,userIdentity')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://mybucket/prefix'
TBLPROPERTIES (
'projection.datehour.format'='yyyy/MM/dd/HH',
'projection.datehour.interval'='1',
'projection.datehour.interval.unit'='HOURS',
'projection.datehour.range'='2021/01/01/00,NOW',
'projection.datehour.type'='date',
'projection.enabled'='true',
'storage.location.template'='s3://mybucket/myprefix/${datehour}'
)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.