简体   繁体   中英

AWS Glue Crawler Cannot Extract CSV Headers

At my wits end here...

I have 15 csv files that I am generating from a beeline query like:

beeline -u CONN_STR --outputformat=dsv -e "SELECT ... " > data.csv

I chose dsv because some string fields include commas and they are not quoted, which breaks glue even more. Besides, according to the docs, the built in csv classifier can handle pipes (and for the most part, it does).

Anyway, I upload these 15 csv files to an s3 bucket and run my crawler.

Everything works great. For 14 of them.

Glue is able to extract the header line for every single file except one, naming the columns col_0 , col_1 , etc, and including the header line in my select queries.

Can anyone provide any insight into what could possibly be different about this one file that is causing this?

If it helps, I have a feeling that some of the fields in this csv file may, at some point, been encoded in UTF-16 or something. When I originally open it, there were some weird "?" characters floating around.

I've run tr -d '\\000' on it in an effort to clean it up, but that could have not been enough.

Again, any leads, suggestions, or experiments I can run would be great. Btw, I would prefer if the crawler was able to do everything (ie: not needing to manually change the schema and turn off updates).

Thanks for reading.

Edit:

Have a feeling this has something to do with it source :

Every column in a potential header parses as a STRING data type.

Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing delimiter, the last column can be empty throughout the file.

Every column in a potential header must meet the AWS Glue regex requirements for a column name.

The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than STRING type. If all columns are of type STRING, then the first row of data is not sufficiently different from subsequent rows to be used as the header.

Adding a Custom Classifier fixed a similar issue of mine.

You can avoid header detection (which doesn't work when all columns are string type) by setting ContainsHeader to PRESENT when creating the custom classifier, and then provide the column names through Header . Once the custom classifier has been created you can assign this to the crawler. Since this is added to the crawler, you won't need to make changes to the schema after the fact, and don't risk these changes being overwritten in the next crawler run. Using boto3, it would look something like:

import boto3


glue = boto3.client('glue')

glue.create_classifier(CsvClassifier={
    'Name': 'contacts_csv',
    'Delimiter': ',',
    'QuoteSymbol': '"',
    'ContainsHeader': 'PRESENT',
    'Header': ['contact_id', 'person_id', 'type', 'value']
})

glue.create_crawler(Name=GLUE_CRAWLER,
                    Role=role.arn,
                    DatabaseName=GLUE_DATABASE,
                    Targets={'S3Targets': [{'Path': s3_path}]},
                    Classifiers=['contacts_csv'])

I was having the same issue where Glue does not recognize the header row when all columns are Strings

I found that adding a new column on the end with an integer solves the problem

id,name,extra_column sdf13,dog,1

如果 csv 是由index_label='row_number'生成的,并且问题是所有列都是字符串,您可以将index_label='row_number'添加到to_csv调用中,让to_csv为您创建额外的列(没有index_label熊猫仍会打印索引,但不会打印标题,这仍然会混淆爬虫)。

Glue header identifier is fragile. Make sure the column names are valid SQL names (ie no spaces) and there're no empty column names (happens often when exporting from excel)

This worked for me,

Classifier name : 'your classifier name'

Classifier type : 'csv'

Column delimiter : ',' (change it based on your preference)

Quote symbol : '"' (should be different from the column delimiter)

Column headings :

  • set this to, 'No headings'
  • your custom column names, eg: userid,username,userphone,useremail

Sample Example of Custom Classfier

I agree with @Thom Lane and @code_freak to use Classifier. It will be better than adding an extra integer column to the end of the whole string-type columns table.

You can read more about Classifier from AWS official docs here: https://docs.aws.amazon.com/glue/latest/dg/console-classifiers.html .

First, you need to explore your data with a name list of columns (headers). Then add the name list into the Classifier. After that, when you create the Crawler, at the step "Add information about your crawler" look for the Custom Classifier and add it to your Crawler.

创建自定义分类器

将自定义分类器添加到爬虫中

It is happening because all column values are string.

I just added a column with name 'id' and its value for all rows as 1 (any integer is fine). And it worked for me.

Yes you are correct about the header part where if the CSV file has all of string data then header will be also considered as string and not as header. As a work around try putting property 'skip.header.line.count'='1' in table properties.

Regarding "?" you should use hex editor to view those invalid characters and to remove them from file.

probably last file you mentioned might have first index enabled. While saving change it to

df.to_csv('./filenam.csv',index=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM