简体   繁体   English

AWS Glue搜寻器读取GZIP标头信息

[英]AWS Glue crawler reading GZIP header info

I have set up a crawler in Glue, which crawls compressed CSV files (GZIP format) from S3 bucket. 我已经在Glue中设置了一个搜寻器,可以从S3存储桶中搜寻压缩的CSV文件(GZIP格式)。 I have an ETL job which converts this CSV into Parquet and another crawler which read parquet file and populates parquet table. 我有一个ETL作业,它将这个CSV转换为Parquet,另一个抓取程序读取了Parquet文件并填充了Parquet表。

The first crawler which reads compressed CSV file (GZIP format) seems like reading GZIP file header information. 第一个读取压缩的CSV文件(GZIP格式)的搜寻器似乎在读取GZIP文件头信息。 CSV file has five fields. CSV文件有五个字段。 Below is the sample data 以下是样本数据

3456789,1,200,20190118,9040 3456789,1,200,20190118,9040

However after crawler populates the table, row looks like below. 但是,搜寻器填充表格后,该行如下所示。

xyz.csv0000644000175200017530113404730513420142427014701 0ustar wlsadmin3456789 1 200 20190118 9040 xyz.csv0000644000175200017530113404730513420142427014701 0ustar wlsadmin3456789 1200 20190118 9040

First column has some additional data which has file name and the user name of the machine where the GZIP file created. 第一列包含一些其他数据,这些数据包含文件名和创建GZIP文件的计算机的用户名。

Any idea, how can I avoid this situation and read correct values. 任何想法,如何避免这种情况并读取正确的值。

In my opinion you will not able to create the table by using AWS Glue Crawler in your case. 我认为您将无法使用AWS Glue Crawler创建表。

https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html

Of course you can try to implement Glue Grok Custom Classifier, but it can be dificult to do. 当然,您可以尝试实现Glue Grok自定义分类器,但是很难做到。

https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#custom-classifier-grok https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#custom-classifier-grok

I suggest to create the table manually, by using Create command in Athena or in Glue console. 我建议通过在Athena或Glue控制台中使用Create命令来手动创建表。 Because you need to skip first row you have to set 'skip.header.line.count'='1' option in TBLPROPERTIES. 因为您需要跳过第一行,所以必须在TBLPROPERTIES中设置'skip.header.line.count'='1'选项。 The definition of this table may look as the following example: 该表的定义如下例所示:

 CREATE EXTERNAL TABLE `gzip_test`(
  `col1` bigint, 
  `col2` bigint, 
  `col3` bigint, 
  `col4` bigint, 
  `col5` bigint)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://example/example'
TBLPROPERTIES (
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='gzip', 
  'delimiter'=',', 
  'skip.header.line.count'='1', 
  'typeOfData'='file')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM