简体   繁体   English

Sqoop导出重复项

[英]Sqoop export duplicates

Will sqoop export create duplicates when the number of mappers is higher than the number of blocks in the source hdfs location? 当映射器的数量大于源hdfs位置中的块数时,sqoop导出会创建重复项吗?

My source hdfs directory has 24 million records and when I do a sqoop export to Postgres table, it somehow creates duplicate records. 我的源hdfs目录有2400万条记录,当我将sqoop导出到Postgres表时,它会以某种方式创建重复的记录。 I have set the number of mappers as 24. There are 12 blocks in the source location. 我已将映射器的数量设置为24。源位置中有12个块。

Any idea why the sqoop is creating duplicates? 知道为什么sqoop正在创建重复项吗?

  • Sqoop Version: 1.4.5.2.2.9.2-1 Sqoop版本:1.4.5.2.2.9.2-1
  • Hadoop Version: Hadoop 2.6.0.2.2.9.2-1 Hadoop版本:Hadoop 2.6.0.2.2.9.2-1

Sqoop Command Used- 使用的Sqoop命令-

sqoop export -Dmapred.job.queue.name=queuename \
--connect jdbc:postgresql://ServerName/database_name \
--username USER --password PWD \
--table Tablename \
--input-fields-terminated-by "\001" --input-null-string "\\\\N" --input-null-non-string "\\\\N" \
--num-mappers 24 -m 24 \
--export-dir $3/penet_baseline.txt -- --schema public;

No sqoop does not export records twice and it has nothing to do with the number of mappers and the number of blocks. 没有sqoop不会两次导出记录,并且与映射器的数量和块的数量无关。


Look at pg_bulkload connector of sqoop for faster data transfer between hdfs and postgres. 查看sqoop的pg_bulkload连接器 ,以在hdfs和postgres之间更快地传输数据。

pg_bulkload connector is a direct connector for exporting data into PostgreSQL. pg_bulkload连接器是用于将数据导出到PostgreSQL的直接连接器。 This connector uses pg_bulkload. 该连接器使用pg_bulkload。 Users benefit from functionality of pg_bulkload such as fast exports bypassing shared bufferes and WAL, flexible error records handling, and ETL feature with filter functions. 用户受益于pg_bulkload的功能,例如绕过共享缓冲区和WAL的快速导出,灵活的错误记录处理以及具有过滤器功能的ETL功能。 By default, sqoop-export appends new rows to a table; 默认情况下,sqoop-export将新行追加到表中; each input record is transformed into an INSERT statement that adds a row to the target database table. 每个输入记录都转换为INSERT语句,该语句在目标数据库表中添加了一行。 If your table has constraints (eg, a primary key column whose values must be unique) and already contains data, you must take care to avoid inserting records that violate these constraints. 如果表具有约束(例如,其键值必须唯一的主键列)并且已经包含数据,则必须注意避免插入违反这些约束的记录。 The export process will fail if an INSERT statement fails. 如果INSERT语句失败,则导出过程将失败。 This mode is primarily intended for exporting records to a new, empty table intended to receive these results. 此模式主要用于将记录导出到旨在接收这些结果的新的空表中。

bagavathi you mentioned that duplicate rows were seen in target table and when you tried to add PK constraint, it failed due to PK violation, further, the source does not have duplicate rows. bagavathi,您提到在目标表中看到重复的行,并且当您尝试添加PK约束时,由于PK违反而导致失败,此外,源没有重复的行。 One possible scenario is that your Target table could already have records which maybe because of a previous incomplete sqoop job. 一种可能的情况是您的目标表可能已经有记录,这可能是由于先前的不完整sqoop作业所致。 Please check whether target table has key which is also in source. 请检查目标表是否具有也在源中的键。

One workaround for this scenario is, use parameter "--update-mode allowinsert". 此方案的一种解决方法是,使用参数“ --update-mode allowinsert”。 In your query, add these parameters, --update-key --update-mode allowinsert. 在查询中,添加以下参数--update-key --update-mode allowinsert。 This will ensure that if key is already present in table then the record will get updated else if key is not present then sqoop will do an insert. 这将确保如果表中已经存在键,则记录将被更新;否则,如果键不存在,则sqoop将进行插入。

If you have used sqoop incremental mode then there many be some duplicate records on HDFS ,before running export to postgres , collect all unique records based on max(date or timestamp column) in one table and then do export . 如果您已使用sqoop增量模式,则HDFS上会有很多重复的记录,在运行导出到postgres之前,请基于max(date或timestamp列)将所有唯一记录收集到一个表中,然后进行导出。 I think it has to work 我认为它必须起作用

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM