简体   繁体   English

sqoop创建impala实木复合地板表

[英]sqoop create impala parquet table

I'm relatively new the process of sqooping so pardon any ignorance. 我是一个相对较新的压印程序,所以请原谅任何无知。 I have been trying to sqoop a table from a data source as a parquet file and create an impala table (also as parquet) into which I will insert the sqooped data. 我一直在尝试从数据源中提取一张表作为木地板文件,并创建一个Impala表(也作为木地板),将经插入的数据插入其中。 The code runs without an issue, but when I try to select a couple rows for testing I get the error: 代码运行没有问题,但是当我尝试选择几行进行测试时,出现错误:

.../EWT_CALL_PROF_DIM_SQOOP/ec2fe2b0-c9fa-4ef9-91f8-46cf0e12e272.parquet' has an incompatible Parquet schema for column 'dru_id.test_ewt_call_prof_dim_parquet.call_prof_sk_id'. Column type: INT, Parquet schema: optional byte_array CALL_PROF_SK_ID [i:0 d:1 r:0]

I was mirroring the process I found on a cloudera guide here: https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_create_table.html . 我正在镜像在cloudera指南上找到的过程: https ://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_create_table.html。 Mainly the "Internal and External Tables" section. 主要是“内部和外部表”部分。 I've been trying to avoid having to infer the schema with a particular parquet file, since this whole thing will be kicked off every month with a bash script (and I also can't think of a way to point it to just one file if I use more than one mapper). 我一直在尝试避免使用特定的Parquet文件来推断架构,因为这整个过程每个月都会用bash脚本启动(而且我也想不出一种将其指向一个文件的方法)如果我使用多个映射器)。

Here's the code I used. 这是我使用的代码。 I feel like I'm either missing something small and stupid, or I've screwed up everything major without realizing it. 我觉得我要么缺少一些小而愚蠢的东西,要么就把所有主要问题搞砸了,却没有意识到。 Any and all help appreciated. 任何和所有帮助表示赞赏。 thanks! 谢谢!

    sqoop import -Doraoop.import.hint=" " \
    --options-file /home/kemri/pass.txt \
    --verbose \
    --connect jdbc:oracle:thin:@ldap://oid:389/cn=OracleContext,dc=[employer],dc=com/EWSOP000 \
    --username [userid] \
    --num-mappers 1 \
    --target-dir hdfs://nameservice1/data/res/warehouse/finance/[dru_userid]/EWT_CALL_PROF_DIM_SQOOP \
    --delete-target-dir \
    --table DMPROD.EWT_CALL_PROF_DIM \
    --direct \
    --null-string '\\N' \
    --null-non-string '\\N' \
    --as-parquetfile 


impala-shell -k -i hrtimpslb.[employer].com


create external table test_EWT_CALL_PROF_DIM_parquet(
CALL_PROF_SK_ID INT,
SRC_SKL_CD_ID STRING,
SPLIT_NM STRING,
SPLIT_DESC STRING,
CLM_SYS_CD STRING,
CLM_SYS_NM STRING,
LOB_CD STRING,
LOB_NM STRING,
CAT_IND STRING,
CALL_TY_CD STRING,
CALL_TY_NM STRING,
CALL_DIR_CD STRING,
CALL_DIR_NM STRING,
LANG_CD STRING,
LANG_NM STRING,
K71_ATOMIC_TS TIMESTAMP)
stored as parquet location '/data/res/warehouse/finance/[dru_userid]/EWT_CALL_PROF_DIM_SQOOP';

As per request in the comments I provide an example of how you could achieve the same using one sqoop import with --hive-import . 根据注释中的请求,我提供了一个示例,说明如何使用--hive-import一次引入--hive-import For obvious reasons I haven't tested it for your specific requirements, so it could need some more tuning which is often the case with these sqoop commands. 出于显而易见的原因,我没有针对您的特定要求对其进行测试,因此可能需要更多调整,而这些sqoop命令通常是这种情况。 In my experience importing as parquet forces you to use the --query option since it doesn't allow you to use schema.table as table. 根据我的经验,导入为实木复合地板会迫使您使用--query选项,因为它不允许您将schema.table用作表。

sqoop import -Doraoop.import.hint=" "\
--verbose \
--connect jdbc:oracle:thin:@ldap://oid:389/cn=OracleContext,dc=[employer],dc=com/EWSOP000 \
--username [userid] \
-m 1 \
--password [ifNecessary] \
--hive-import \
--query 'SELECT * FROM DMPROD.EWT_CALL_PROF_DIM WHERE $CONDITIONS' \
--hive-database [database you want to use] \
--hive-table test_EWT_CALL_PROF_DIM_parquet \
--target-dir hdfs://nameservice1/data/res/warehouse/finance/[dru_userid]/EWT_CALL_PROF_DIM_SQOOP \
--null-string '\\N' \
--null-non-string '\\N' \
--as-parquetfile

Basically what you need for --hive-import is --hive-database , --hive-table and --query . 基本上,-- --hive-import--hive-database ,-- --hive-table--query If you don't want all your columns to appear in Hive as strings you must also include: 如果您不希望所有列都以字符串形式出现在Hive中,则还必须包括:

--map-hive-columns [column_name1=Timestamp,column_name2=Int,...]

You might need a similar --map-java-columns as well, but I'm never sure when this is required. 您可能还需要类似的--map-java-columns ,但是我不确定何时需要。 You will need a --split-by if you want multiple mappers 如果要多个映射器--split-by则需要--split-by

As discussed in the comments you will need to use invalidate metadata db.table to make sure Impala sees these changes. 如注释中所述,您将需要使用invalidate metadata db.table来确保Impala看到这些更改。 You could issue both commands from CL or a single bash-script where you can issue the impala command using impala-shell -q [query] . 您可以从CL发出两个命令,也可以发出一个bash脚本,在其中可以使用impala-shell -q [query]发出impala命令。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM