表格的异常通过AWS Glue Crawler识别并存储在数据目录中

Question

I'm working to build the new data lake of the company and are trying to find the best and the most recent option to work here. 我正在努力建立公司的新数据湖，并试图找到最好的和最近的选择在这里工作。 So, I found a pretty nice solution to work with EMR + S3 + Athena + Glue. 因此，我找到了一个非常好的解决方案来使用EMR + S3 + Athena + Glue。

The process that I did was: 我做的过程是：

1 - Run Apache Spark script to generate 30 millions rows partitioned by date at S3 stored by Orc. 1 - 运行Apache Spark脚本，在Orc存储的S3中按日期分区生成3000万行。

2 - Run a Athena query to create the external table. 2 - 运行Athena查询以创建外部表。

3 - Checked the table at EMR connected with Glue Data Catalog and it worked perfect. 3 - 检查与胶水数据目录相关的EMR表，它运行良好。 Both Spark and Hive were able to access. Spark和Hive都可以访问。

4 - Generate another 30 millions rows in other folder partitioned by date. 4 - 在按日期分区的其他文件夹中生成另外3000万行。 In Orc format 在Orc格式

5 - Ran the Glue Crawler that identify the new table. 5 - 运行识别新表的Glue Crawler。 Added to Data Catalog and Athena was able to do the query. 添加到数据目录，Athena能够进行查询。 But Spark and Hive aren't able to do it. 但Spark和Hive无法做到这一点。 See the exception below: 请参阅以下例外：

Spark Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcStruct Spark Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcStruct

Hive Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating audit_id (state=,code=0) Hive Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating audit_id (state=,code=0)

I was checking if was any serialisation problem and I found this: 我正在检查是否有任何序列化问题，我发现了这个：

Table created manually (Configuration): 手动创建的表（配置）：

Input format org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 输入格式 org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

Output format org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat 输出格式为 org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat

Serde serialization lib org.apache.hadoop.hive.ql.io.orc.OrcSerde Serde序列化lib org.apache.hadoop.hive.ql.io.orc.OrcSerde

orc.compress SNAPPY orc.compress SNAPPY

Table Created with Glue Crawler: 使用Glue Crawler创建的表：

Input format org.apache.hadoop.mapred.TextInputFormat 输入格式 org.apache.hadoop.mapred.TextInputFormat

Output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 输出格式 org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Serde serialization lib org.apache.hadoop.hive.ql.io.orc.OrcSerde Serde序列化lib org.apache.hadoop.hive.ql.io.orc.OrcSerde

So, this is not working to read from Hive or Spark. 因此，这不适用于从Hive或Spark读取。 It works for Athena. 它适用于雅典娜。 I already changed the configurations but with no effect at Hive or Spark. 我已经更改了配置，但对Hive或Spark没有任何影响。

Anyone faced that problem? 有人遇到过这个问题？

Answer 1

Well, 好，

After few weeks that I posted this question AWS fixed the problem. 几个星期后我发布了这个问题，AWS解决了这个问题。 As I showed above, the problem was real and that was a bug from Glue. 正如我上面所说，问题是真实的，这是Glue的一个错误。

As it is a new product and still have some problems some times. 因为它是一种新产品，有时仍然存在一些问题。

But this was solved properly. 但这已得到妥善解决。 See the properties of the table now: 现在查看表格的属性：

ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

表格的异常通过AWS Glue Crawler识别并存储在数据目录中

问题描述

Table created manually (Configuration): 手动创建的表（配置）：

Table Created with Glue Crawler: 使用Glue Crawler创建的表：

1 个解决方案

解决方案1
3 已采纳 2017-09-25 05:13:36

表格的异常通过AWS Glue Crawler识别并存储在数据目录中

问题描述

Table created manually (Configuration): 手动创建的表（配置）：

Table Created with Glue Crawler: 使用Glue Crawler创建的表：

1 个解决方案

解决方案1 3 已采纳 2017-09-25 05:13:36

解决方案1
3 已采纳 2017-09-25 05:13:36