简体   繁体   中英

AWS Glue Crawler cannot parse large files (classification UNKNOWN)

I've been working on trying to use the crawler from AWS Glue to try to obtain the columns and other features of a certain json file.

I've parsed the json file locally by converting it to UTF-8 and using boto3 to move it into an s3 container and accessing that container from the crawler.

I created a json classifier with the custom classifier $[*] and created a crawler with normal settings.

When I do this with a file that is relatively small (<50 Kb) the crawler correctly identifies the columns as well as the internal schema of the inner json layers within the main json. However, the file that I am trying to do with (around 1 Gb), the crawler has "UNKNOWN" as the classification and cannot identify any columns and thus I cannot query it.

Any ideas for the issue or some kind of work around?

I am ultimately trying to convert it to a Parquet format and doing some querying with Athena.

I've looked at the following post but this solution did not work. I've already tried rewriting my classifier and crawler. I also presume that these are not the core problems because I used $[*] as my custom classifier and used practically identical settings while trying to do this with the smaller file with the same result.

I'm beginning to think that the reason is just because of the large file size.

I might be wrong, but there is sort of limit for file size that could be processed. Try to split your big file into files 10Mb(it's recommended size). Crawler will process those files in parallel and when you run it again, it will process only changed/new files. Sorry, I couldn't find related aws documentation, just try it out and see if it will work

The following is the fix that I ended up using.

Found that the AWS Glue crawler likes jsons separated by commas (no outer array brackets).

For example, if you had a large file like in the following format:

[
  {},
  {},
  {},...
]

You can manually remove the last and first character with something like str[1:-1] giving you:

{}
{}
{}...

I ended up splitting up the file into smaller pieces (between 10-50 MB from the original 1 GB file), and the crawler seemed to be okay with that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM