[英]sqoop create impala parquet table
I'm relatively new the process of sqooping so pardon any ignorance. 我是一个相对较新的压印程序,所以请原谅任何无知。 I have been trying to sqoop a table from a data source as a parquet file and create an impala table (also as parquet) into which I will insert the sqooped data.
我一直在尝试从数据源中提取一张表作为木地板文件,并创建一个Impala表(也作为木地板),将经插入的数据插入其中。 The code runs without an issue, but when I try to select a couple rows for testing I get the error:
代码运行没有问题,但是当我尝试选择几行进行测试时,出现错误:
.../EWT_CALL_PROF_DIM_SQOOP/ec2fe2b0-c9fa-4ef9-91f8-46cf0e12e272.parquet' has an incompatible Parquet schema for column 'dru_id.test_ewt_call_prof_dim_parquet.call_prof_sk_id'. Column type: INT, Parquet schema: optional byte_array CALL_PROF_SK_ID [i:0 d:1 r:0]
I was mirroring the process I found on a cloudera guide here: https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_create_table.html . 我正在镜像在cloudera指南上找到的过程: https ://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_create_table.html。 Mainly the "Internal and External Tables" section.
主要是“内部和外部表”部分。 I've been trying to avoid having to infer the schema with a particular parquet file, since this whole thing will be kicked off every month with a bash script (and I also can't think of a way to point it to just one file if I use more than one mapper).
我一直在尝试避免使用特定的Parquet文件来推断架构,因为这整个过程每个月都会用bash脚本启动(而且我也想不出一种将其指向一个文件的方法)如果我使用多个映射器)。
Here's the code I used. 这是我使用的代码。 I feel like I'm either missing something small and stupid, or I've screwed up everything major without realizing it.
我觉得我要么缺少一些小而愚蠢的东西,要么就把所有主要问题搞砸了,却没有意识到。 Any and all help appreciated.
任何和所有帮助表示赞赏。 thanks!
谢谢!
sqoop import -Doraoop.import.hint=" " \
--options-file /home/kemri/pass.txt \
--verbose \
--connect jdbc:oracle:thin:@ldap://oid:389/cn=OracleContext,dc=[employer],dc=com/EWSOP000 \
--username [userid] \
--num-mappers 1 \
--target-dir hdfs://nameservice1/data/res/warehouse/finance/[dru_userid]/EWT_CALL_PROF_DIM_SQOOP \
--delete-target-dir \
--table DMPROD.EWT_CALL_PROF_DIM \
--direct \
--null-string '\\N' \
--null-non-string '\\N' \
--as-parquetfile
impala-shell -k -i hrtimpslb.[employer].com
create external table test_EWT_CALL_PROF_DIM_parquet(
CALL_PROF_SK_ID INT,
SRC_SKL_CD_ID STRING,
SPLIT_NM STRING,
SPLIT_DESC STRING,
CLM_SYS_CD STRING,
CLM_SYS_NM STRING,
LOB_CD STRING,
LOB_NM STRING,
CAT_IND STRING,
CALL_TY_CD STRING,
CALL_TY_NM STRING,
CALL_DIR_CD STRING,
CALL_DIR_NM STRING,
LANG_CD STRING,
LANG_NM STRING,
K71_ATOMIC_TS TIMESTAMP)
stored as parquet location '/data/res/warehouse/finance/[dru_userid]/EWT_CALL_PROF_DIM_SQOOP';
As per request in the comments I provide an example of how you could achieve the same using one sqoop import with --hive-import
. 根据注释中的请求,我提供了一个示例,说明如何使用
--hive-import
一次引入--hive-import
。 For obvious reasons I haven't tested it for your specific requirements, so it could need some more tuning which is often the case with these sqoop commands. 出于显而易见的原因,我没有针对您的特定要求对其进行测试,因此可能需要更多调整,而这些sqoop命令通常是这种情况。 In my experience importing as parquet forces you to use the
--query
option since it doesn't allow you to use schema.table as table. 根据我的经验,导入为实木复合地板会迫使您使用
--query
选项,因为它不允许您将schema.table用作表。
sqoop import -Doraoop.import.hint=" "\
--verbose \
--connect jdbc:oracle:thin:@ldap://oid:389/cn=OracleContext,dc=[employer],dc=com/EWSOP000 \
--username [userid] \
-m 1 \
--password [ifNecessary] \
--hive-import \
--query 'SELECT * FROM DMPROD.EWT_CALL_PROF_DIM WHERE $CONDITIONS' \
--hive-database [database you want to use] \
--hive-table test_EWT_CALL_PROF_DIM_parquet \
--target-dir hdfs://nameservice1/data/res/warehouse/finance/[dru_userid]/EWT_CALL_PROF_DIM_SQOOP \
--null-string '\\N' \
--null-non-string '\\N' \
--as-parquetfile
Basically what you need for --hive-import
is --hive-database
, --hive-table
and --query
. 基本上,--
--hive-import
是--hive-database
,-- --hive-table
和--query
。 If you don't want all your columns to appear in Hive as strings you must also include: 如果您不希望所有列都以字符串形式出现在Hive中,则还必须包括:
--map-hive-columns [column_name1=Timestamp,column_name2=Int,...]
You might need a similar --map-java-columns
as well, but I'm never sure when this is required. 您可能还需要类似的
--map-java-columns
,但是我不确定何时需要。 You will need a --split-by
if you want multiple mappers 如果要多个映射器
--split-by
则需要--split-by
As discussed in the comments you will need to use invalidate metadata db.table
to make sure Impala sees these changes. 如注释中所述,您将需要使用
invalidate metadata db.table
来确保Impala看到这些更改。 You could issue both commands from CL or a single bash-script where you can issue the impala command using impala-shell -q [query]
. 您可以从CL发出两个命令,也可以发出一个bash脚本,在其中可以使用
impala-shell -q [query]
发出impala命令。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.