简体   繁体   English

Logstash-JDBC 插件:将数百万条记录从 Oracle 导入 ElasticSearch

[英]Logstash-JDBC plugin : Importing millions of records from Oracle to ElasticSearch

Scenario : - I am trying to import a large data set from Oracle DB (Few Million records) into ElasticSearch.场景: - 我正在尝试将大型数据集从 Oracle DB(几百万条记录)导入 ElasticSearch。 - I am able to import data using logstash jdbc input plugin. - 我可以使用 logstash jdbc 输入插件导入数据。

Issue : - Its taking a very long time (5 hrs) to complete.问题: - 需要很长时间(5 小时)才能完成。

  • Is there a better solution to reduce this time ?有没有更好的解决方案来减少这个时间? Good practices in such scenarios ?在这种情况下的良好做法

You can play with jdbc_fetch_size first, the next step will be running import on several nodes, for each node split the sql query into nonoverlapping sets like select * from XX where id between 0 and 1000000 .您可以先使用jdbc_fetch_size ,下一步将在多个节点上运行导入,对于每个节点,将 sql 查询拆分为不重叠的集合,例如select * from XX where id between 0 and 1000000

You can also set number of ES index replicas to 0 before indexing big bulk and change it back after.您还可以在索引大批量之前将 ES 索引副本的数量设置为 0,然后将其更改回来。 And of course use bulk indexing API with good batch size and parallelism.当然,使用批量索引 API 具有良好的批量大小和并行性。

You can use你可以使用

:sql_last_value :sql_last_value

option to get data from the database incrementally with a scheduler.使用调度程序从数据库中增量获取数据的选项。 Something like following will be helpful.像下面这样的东西会有所帮助。 (Assuming you have an ID field) (假设您有一个 ID 字段)

input {
   jdbc {
      jdbc_connection_string => "jdbc:oracle:thin:IP:PORT:SID"
      jdbc_user => "user"
      jdbc_password => "password"
      jdbc_driver_library => ".......\ojdbc8.jar"
      jdbc_driver_class => "Java::oracle.jdbc.OracleDriver"
      statement => "select * from (select * from TABLE where id >= :sql_last_value ORDER BY id ASC) rownum < 20"
      use_column_value => true
      tracking_column => id
      tracking_column_type => "numeric"
      schedule => "* * * * *"
  } 

} }

update: Refactored SQL to use rownum and ORDER BY to limit the results in the intended order (sort before limiting).更新:重构 SQL 以使用rownumORDER BY以预期顺序限制结果(限制前排序)。 See: On ROWNUM and Limiting Results请参阅: 关于 ROWNUM 和限制结果

Same result can be achieved using pagination but it has performance issues.使用分页可以实现相同的结果,但它存在性能问题。 If we use pagination ex:如果我们使用分页例如:

input {
   jdbc {
      jdbc_connection_string => "jdbc:oracle:thin:IP:PORT:SID"
      jdbc_user => "user"
      jdbc_password => "password"
      jdbc_driver_library => ".......\ojdbc8.jar"
      jdbc_driver_class => "Java::oracle.jdbc.OracleDriver"
      jdbc_page_size => 20
      jdbc_paging_enabled => true
      statement => "select * from TABLE"
  } 
}

But this will wrap但这会包裹

"select * from TABLE" “从表中选择*”

like喜欢

SELECT * FROM (SELECT "T1".*, ROWNUM "X_SEQUEL_ROW_NUMBER_X" FROM (SELECT * FROM (select * from TABLE) "T1") "T1") "T1" WHERE (("X_SEQUEL_ROW_NUMBER_X" > 0) AND ("X_SEQUEL_ROW_NUMBER_X" <= (0 + 20))); SELECT * FROM (SELECT "T1".*, ROWNUM "X_SEQUEL_ROW_NUMBER_X" FROM (SELECT * FROM (select * from TABLE) "T1") "T1") "T1" WHERE (("X_SEQUEL_ROW_NUMBER_X" > 0) AND ("X_SEQUEL_ROW_NUMBER_X " <= (0 + 20)));

and will run without a scheduler by dividing the query based on the specified jdbc_page_size (20 for this example) .并且将通过基于指定的 jdbc_page_size (本例为 20)划分查询来在没有调度程序的情况下运行。 But this method obviously has performance issues.但是这种方法显然存在性能问题。

You can put more than jdbc input plugin in the same config file .. It works fine with me.您可以在同一个配置文件中放置多个 jdbc 输入插件..它对我来说很好用。

ie:即:

input { jdbc { ... }输入 { jdbc { ... }

   jdbc {
       ...
      }

} }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM