How to transfer RDBMS data with multiple primary key into elasticsearch

Question

I'd like to transfer data from RDBMS to elasticsearch.

From MySQL data

+---------+-----------+
| col1    | col2      |
+---------+-----------+
| a       |      abc  |
| a       |      def  |
| a       |      ghi  |
| a       |      jkl  |
| a       |      mno  |
| b       |      pqr  |
| b       |      stu  |
| b       |      vwx  |
+---------+-----------+

to Elasticsearch Data

{
  "col1" : "a",
  "col2" : ["abc", "def", "ghi", "jkl", "mno"]
}

{
  "col1" : "b",
  "col2" : ["pqr", "stu", "vwx"]
}

I'd like to put col1 as '_id'.
Is it possible by Logstash or any other transfer tool?

Answer 1

You can add exactly what you want as an _id in elasticsearch. Logstash let you define it easily in the elasticsearch output with the document_id param.

Answer 2

Thanks to Jaycreation and googling. I just found out how to import data from MySQL to EL. First of all, conf file will be like this.

input {
  jdbc {
    jdbc_driver_library => "/usr/share/java/mysql-connector-java-8.0.21.jar"
    jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
    jdbc_connection_string => "dbname:port/dev_new_areyousick_algorithm?verifyServerCertificate=false&useSSL=false&serverTimezone=UTC"
    jdbc_user => "user"
    jdbc_password => "password"
    jdbc_paging_enabled => true
    tracking_column => "unix_ts_in_secs"
    use_column_value => true
    tracking_column_type => "numeric"
    schedule => "*/10 * * * * *"
    #statement => "SELECT *, UNIX_TIMESTAMP(update_dt) AS unix_ts_in_secs FROM tb_hospinfo_review WHERE update_dt < NOW() ORDER BY update_dt ASC"
    statement => "SELECT * FROM table_name"
  }
}
filter {
  mutate {
    copy => { "col1" => "[@metadata][_id]"}
    remove_field => ["@version", "@timestamp"]
  }
  aggregate {
    task_id => "%{col1}"
    code => "
      map['col1'] ||= event.get('col1')
      map['col2s'] ||= []
      map['col2s'] << {'col2' => event.get('col2')}
      event.cancel()
    "
    push_previous_map_as_event => true
    timeout => 3
  }

}
output {
  # stdout { codec =>  "rubydebug"}
  elasticsearch {
      index => "keyword_table"
      document_id => "%{col1}"
  }
}

More formal example is shown in https://www.elastic.co/guide/en/logstash/current/plugins-filters-aggregate.html#plugins-filters-aggregate-example4

and most important thing is(in my term) the way to execute on CLI.

At first, I put the line like

sudo ./logstash --path.settings /etc/logstash -f ./logstash-conf.conf

but some data was imported but some data was not.

And I've just realized that Logstash is parallel processing pipelines. In filter plugin, 'aggregate' call is unique to the pipelines. So I've added -w 1 at the end of the CLI line and it works!!

How to transfer RDBMS data with multiple primary key into elasticsearch

Question

2 answers

solution1
0 ACCPTED 2020-09-22 10:09:46

solution2
0 2020-09-25 08:41:05

How to transfer RDBMS data with multiple primary key into elasticsearch

Question

2 answers

solution1 0 ACCPTED 2020-09-22 10:09:46

solution2 0 2020-09-25 08:41:05

solution1
0 ACCPTED 2020-09-22 10:09:46

solution2
0 2020-09-25 08:41:05