配置單元到RDBMS增量導出

Question

我正在探索sqoop將數據從配置單元發送到RDBMS。 我不想一次又一次發送相同的數據。 我需要確定HDFS中的更改，並僅發送自上次導出以來發生更改的數據。 實現這種增量導出邏輯的最佳方法是什么？ 我看到sqoop導入具有增量邏輯選項。 但在導出中看不到它。

任何建議/建議將不勝感激。

Answer 1

您可以在Hive（TABLE_NAME_CHANGED）中使用更改記錄創建新表或視圖，並將其用於導入RDBMS。

Answer 2

假設您在配置單元中有一個timestamp字段來標識增量，則可以實現以下增量導出。

每次導出數據之前，您都必須檢查RDBMS中的最大時間戳，並使用它來創建導出文件。

##Checking the max date in RDBMS
#You can tweak with the command based on the stack thats produced by your sqoop 
mxdt=$(sqoop eval --connect 'jdbc:oracle:thin:@HOST:PORT/SSID' --username hadoop -password hadoop --query "select max(timestamp_filed) from schema.table" | awk "NR==6{print;exit}" | sed 's/|//g' | sed ''s/[^[:print:]]//g'' | sed 's/ //g')

#Based on the mxdt variable you can create a file from beeline/hive as below
beeline -u ${ConnString} --outputformat=csv2 --showHeader=false --silent=true --nullemptystring=true --incremental=true -e "select * from hiveSchema.hiveTable where timestamp > ${mxdt}" >> /SomeLocalPath/FileName.csv

#Copy file to hdfs

hdfs dfs -put /SomeLocalPath/FileName.csv2 /tmp/

#Now use the file in hdfs to do the sqoop export
sqoop export --connect 'jdbc:oracle:thin:@HOST:PORT/SSID' --username hadoop -password hadoop --export-dir '/tmp/FileName.csv' --table RDBMSSCHEMA.RDBMSTABLE --fields-terminated-by "," --lines-terminated-by "\n" -m 1 --columns "col1,col2,"

配置單元到RDBMS增量導出

問題描述

2 個解決方案

解決方案1
0 2018-03-23 14:33:38

解決方案2
0 2018-03-23 15:00:11

配置單元到RDBMS增量導出

問題描述

2 個解決方案

解決方案1 0 2018-03-23 14:33:38

解決方案2 0 2018-03-23 15:00:11

解決方案1
0 2018-03-23 14:33:38

解決方案2
0 2018-03-23 15:00:11