简体   繁体   中英

AWS Glue crawler reading GZIP header info

I have set up a crawler in Glue, which crawls compressed CSV files (GZIP format) from S3 bucket. I have an ETL job which converts this CSV into Parquet and another crawler which read parquet file and populates parquet table.

The first crawler which reads compressed CSV file (GZIP format) seems like reading GZIP file header information. CSV file has five fields. Below is the sample data

3456789,1,200,20190118,9040

However after crawler populates the table, row looks like below.

xyz.csv0000644000175200017530113404730513420142427014701 0ustar wlsadmin3456789 1 200 20190118 9040

First column has some additional data which has file name and the user name of the machine where the GZIP file created.

Any idea, how can I avoid this situation and read correct values.

In my opinion you will not able to create the table by using AWS Glue Crawler in your case.

https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html

Of course you can try to implement Glue Grok Custom Classifier, but it can be dificult to do.

https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#custom-classifier-grok

I suggest to create the table manually, by using Create command in Athena or in Glue console. Because you need to skip first row you have to set 'skip.header.line.count'='1' option in TBLPROPERTIES. The definition of this table may look as the following example:

 CREATE EXTERNAL TABLE `gzip_test`(
  `col1` bigint, 
  `col2` bigint, 
  `col3` bigint, 
  `col4` bigint, 
  `col5` bigint)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://example/example'
TBLPROPERTIES (
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='gzip', 
  'delimiter'=',', 
  'skip.header.line.count'='1', 
  'typeOfData'='file')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM