簡體   English   中英

在AWS Data Pipeline中從DynamoDB導出到S3的此HIVE腳本有什么問題?

[英]What's wrong with this HIVE script to export from DynamoDB to S3 in an AWS Data Pipeline?

下面的HIVE腳本是否有問題,還是這是另一個問題,可能與AWS Data Pipeline安裝的HIVE版本有關?

我的AWS Data Pipeline的第一部分必須將大型表從DynamoDB導出到S3,以便以后使用EMR處理。 我用於測試的DynamoDB表只有幾行長,因此我知道數據的格式正確。

與AWS Data Pipeline“將DynamoDB導出到S3”構建塊關聯的腳本對於僅包含primitive_types但不導出array_type的表正常工作。 (參考-http://archive.cloudera.com/cdh/3/hive/language_manual/data-manipulation-statements.html

我撤出了所有特定於數據管道的內容,現在嘗試基於DynamoDB文檔工作以下最小示例-(參考-http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands 。 html

-- Drop table
DROP table dynamodb_table;

--http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands.html
CREATE EXTERNAL TABLE dynamodb_table (song string, artist string, id string, genres array<string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "InputDB",
"dynamodb.column.mapping" = "song:song,artist:artist,id:id,genres:genres");

INSERT OVERWRITE DIRECTORY 's3://umami-dev/output/colmap/' SELECT *
FROM dynamodb_table;

這是我運行上述腳本時看到的堆棧跟蹤/ EMR錯誤-

Diagnostic Messages for this Task:
java.io.IOException: IO error in map input file hdfs://172.31.40.150:9000/mnt/hive_0110/warehouse/dynamodb_table
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:244)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:218)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:238)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.scan(AbstractDynamoDBRecordReader.java:176)
at org.apache.hadoop.hive.dynamodb.read.HiveDynamoDBRecordReader.fetchItems(HiveDynamoDBRecordReader.java:87)
at org.apache.hadoop.hive.dynamodb.read.HiveDynamoDBRecordReader.next(HiveDynamoDBRecordReader.java:44)
at org.apache.hadoop.hive.dynamodb.read.HiveDynamoDBRecordReader.next(HiveDynamoDBRecordReader.java:25)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
... 13 more

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched: 
Job 0: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Command exiting with ret '255'

我已經嘗試了一些調試的東西,但是都沒有成功-使用一些不同的JSON SerDes創建帶有格式的外部表。 我不確定接下來要嘗試什么。

非常感謝。

我通過創建EMR集群並使用Hue在亞馬遜環境中快速運行HIVE查詢來回答自己的問題。

解決方案是在DynamoDB中更改項目的格式-最初是字符串列表,現在是StringSet。 然后,我的Hive表可以在陣列上成功運行。

從邏輯上講,我可能會丟失字符串的順序,因為我假設列表是有序的,而集合不是。 這對我的特定問題無關緊要。

這是最終運行的Hive腳本的相關部分-

-- depends on genres2 to be a StringSet (or not exist)
CREATE EXTERNAL TABLE sauce (id string, artist string, song string, genres2 array<string>)
STORED BY "org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler"
TBLPROPERTIES ("dynamodb.table.name" = "InputDB",
"dynamodb.column.mapping" = "id:id,artist:artist,song:song,genres2:genres2");

-- s3 location for export to
CREATE EXTERNAL TABLE pasta (id int, artist string, song string, genres array<string>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY '|'
LOCATION "s3n://umami-dev/tmp2";

-- do the export
INSERT OVERWRITE TABLE pasta
SELECT * FROM sauce;

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM