简体   繁体   English

使用Sqoop从MySQL导入到Hive

[英]Import from MySQL to Hive using Sqoop

I have to import > 400 million rows from a MySQL table(having a composite primary key) into a PARTITIONED Hive table Hive via Sqoop. 我必须通过Sqoop从MySQL表(具有复合主键)中导入> 4亿行到PARTITIONED Hive表Hive中。 The table has data for two years with a column departure date ranging from 20120605 to 20140605 and thousands of records for one day. 该表包含两年的数据,列的离开日期为20120605至20140605,一天的数千条记录。 I need to partition the data based on the departure date. 我需要根据出发日期对数据进行分区。

The versions : 版本:

Apache Hadoop - 1.0.4 Apache Hadoop-1.0.4

Apache Hive - 0.9.0 Apache Hive-0.9.0

Apache Sqoop - sqoop-1.4.2.bin__hadoop-1.0.0 Apache Sqoop-sqoop-1.4.2.bin__hadoop-1.0.0

As per my knowledge, there are 3 approaches: 据我所知,有3种方法:

  1. MySQL -> Non-partitioned Hive table -> INSERT from Non-partitioned Hive table into Partitioned Hive table MySQL->非分区Hive表->从非分区Hive表插入分区Hive表
  2. MySQL -> Partitioned Hive table MySQL->分区的Hive表
  3. MySQL -> Non-partitioned Hive table -> ALTER Non-partitioned Hive table to add PARTITION MySQL->未分区的Hive表-> ALTER未分区的Hive表添加PARTITION

    1. is the current painful one that I'm following 我目前正在经历的痛苦

    2. I read that the support for this is added in later(?) versions of Hive and Sqoop but was unable to find an example 我读到对此的支持是在Hive和Sqoop的更高版本中添加的,但找不到示例

    3. The syntax dictates to specify partitions as key value pairs – not feasible in case of millions of records where one cannot think of all the partition key-value pairs 3. 该语法要求将分区指定为键值对-在数百万条记录中无法想到所有分区键值对的情况下不可行3。

Can anyone provide inputs for approaches 2 and 3? 谁能提供方法2和方法3的输入?

I guess you can create a hive partitioned table. 我想您可以创建一个配置单元分区表。

Then write the sqoop import code for it. 然后为其编写sqoop导入代码。

for example: 例如:

sqoop import --hive-overwrite --hive-drop-import-delims --warehouse-dir "/warehouse" --hive-table \\ --connect jdbc< mysql path>/DATABASE=xxxx\\ --table --username xxxx --password xxxx --num-mappers 1 --hive-partition-key --hive-partition-value --hive-import \\ --fields-terminated-by ',' --lines-terminated-by '\\n' sqoop import --hive-overwrite --hive-drop-import-delims --warehouse-dir“ / warehouse” --hive-table \\ --connect jdbc <mysql path> / DATABASE = xxxx \\ --table --username xxxx --password xxxx --num-mappers 1 --hive-partition-key --hive-partition-value --hive-import \\ --fields终止于','--lines终止于'\\ n'

You have to create a partitioned table structure first, before you move your data to table into partitioned table. 您必须先创建分区表结构,然后才能将数据移动到表中。 While sqoop, no need to specify --hive-partition-key and --hive-partition-value, use --hcatalog-table instead of --hive-table. 进行sqoop时,无需指定--hive-partition-key和--hive-partition-value,请使用--hcatalog-table而不是--hive-table。

Manu 马努

If this is still something people wanted to understand, they can use 如果这仍然是人们想要了解的东西,他们可以使用

sqoop import --driver <driver name> --connect <connection url> --username <user name> -P --table employee  --num-mappers <numeral> --warehouse-dir <hdfs dir> --hive-import --hive-table table_name --hive-partition-key departure_date --hive-partition-value $departure_date

Notes from the patch: 补丁说明:

sqoop import [all other normal command line options] --hive-partition-key ds --hive-partition-value "value"

Some limitations: 一些限制:

  • It only allows for one partition key/value 它仅允许一个分区键/值
  • hardcoded the type for the partition key to be a string 将分区键的类型硬编码为字符串
  • With auto partitioning in hive 0.7 we may want to adjust this to just have one command line option for the key name and use that column from the db table to partition. 在hive 0.7中使用自动分区时,我们可能希望将其调整为仅具有一个命令行选项作为键名,并使用db表中的该列进行分区。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM