简体   繁体   English

Logstash - csv 输出标头

[英]Logstash - csv output headers

I'm trying to request database with logstash jdbc plugins and returns a csv output file with headers with logstash csv plugin .我正在尝试使用logstash jdbc 插件请求数据库,并返回一个带有logstash csv plugin标头的 csv 输出文件。

I spent a lot of time on logstash documentation but I'm still missing a point.我花了很多时间在 logstash 文档上,但我仍然遗漏了一点。

With the following logstash configuration, the results give me a file with headers for each row.使用以下 logstash 配置,结果为我提供了一个包含每一行标题的文件。 I couldn't find a way to add the headers for only the first row in the logstash configuration.我找不到只为 logstash 配置中的第一行添加标题的方法。

Helps very much appreciated.非常感谢帮助。

Output file输出文件

_object$id;_object$name;_object$type;nb_surveys;csat_score
2;Jeff Karas;Agent;2;2  
_object$id;_object$name;_object$type;nb_surveys;csat_score
3;John Lafer;Agent;2;2;2;2;$2;2
_object$id;_object$name;_object$type;nb_surveys;csat_score
4;Michele Fisher;Agent;2;2
_object$id;_object$name;_object$type;nb_surveys;csat_score
5;Chad Hendren;Agent;2;78

file: simple-out.conf文件:simple-out.conf

input {
    jdbc {
        jdbc_connection_string => "jdbc:postgresql://localhost:5432/postgres"
        jdbc_user => "postgres"
        jdbc_password => "postgres"
        jdbc_driver_library => "/tmp/drivers/postgresql/postgresql_jdbc.jar"
        jdbc_driver_class => "org.postgresql.Driver"
        statement_filepath => "query.sql"
    }
}
output {
    csv {
        fields => ["_object$id","_object$name","_object$type","nb_surveys","csat_score"]
        path => "output/%{team}/output-%{team}.%{+yyyy.MM.dd}.csv"
        csv_options => {
        "write_headers" => true
        "headers" =>["_object$id","_object$name","_object$type","nb_surveys","csat_score"]
        "col_sep" => ";"
        }
    }
}

Thanks谢谢

The reason why you are getting multiple headers in the output is because Logstash has no concept of global/shared state between events, each item is handled in isolation so every time the CSV output plugin runs it behaves like the first one and writes the headers.您在输出中获得多个标头的原因是因为 Logstash 没有事件之间的全局/共享状态的概念,每个项目都是独立处理的,因此每次 CSV 输出插件运行时,它的行为都与第一个相同并写入标头。

I had the same issue and found a solution using the init option of the ruby filter to execute some code at logstash startup-time.我遇到了同样的问题,并找到了使用 ruby​​ 过滤器的init选项在 logstash 启动时执行一些代码的解决方案。

Here is an example logstash config:这是一个示例 logstash 配置:

# csv-headers.conf

input {
    stdin {}
}
filter {
    ruby {
        init => "
            begin
                @@csv_file    = 'output.csv'
                @@csv_headers = ['A','B','C']
                if File.zero?(@@csv_file) || !File.exist?(@@csv_file)
                    CSV.open(@@csv_file, 'w') do |csv|
                        csv << @@csv_headers
                    end
                end
            end
        "
        code => "
            begin
                event['@metadata']['csv_file']    = @@csv_file
                event['@metadata']['csv_headers'] = @@csv_headers
            end
        "
    }
    csv {
        columns => ["a", "b", "c"]
    }
}
output {
    csv {
        fields => ["a", "b", "c"]
        path   => "%{[@metadata][csv_file]}"
    }
    stdout {
        codec => rubydebug {
            metadata => true
        }
    }
}

If you run Logstash with that config:如果您使用该配置运行 Logstash:

echo "1,2,3\n4,5,6\n7,8,9" | ./bin/logstash -f csv-headers.conf

You will get an output.csv file with this content:您将获得一个包含以下内容的output.csv文件:

A,B,C
1,2,3
4,5,6
7,8,9

This is also thread-safe because it runs the code on startup only, so you can use multiple workers.这也是线程安全的,因为它仅在启动时运行代码,因此您可以使用多个工作线程。

Hope it helps!希望有帮助!

I am using dynamic file names that leverage the date of the event (index-YYYY-MM-DD.csv) so writing the headers on pipeline start was not a viable option for me.我正在使用利用事件日期 (index-YYYY-MM-DD.csv) 的动态文件名,因此在管道开始时写入标头对我来说不是一个可行的选择。

Instead, I allowed the duplicate headers to be written and set up a cron job to run every few minutes and remove all duplicate rows and write the result back into the same file.相反,我允许写入重复的标题并设置一个 cron 作业每隔几分钟运行一次并删除所有重复的行并将结果写回同一个文件。

#!/bin/bash -xe
 for filename in /tmp/logstash/*.csv; do awk '!v[$1]++' $filename > $filename.tmp && mv -f $filename.tmp $filename; done

NOTE: This is only tested on an instance where I am pulling a couple hundred MB of data - this may not be a viable option if your data pipeline is ingesting GB per minute.注意:这仅在我提取几百 MB 数据的实例上进行了测试 - 如果您的数据管道每分钟摄取 GB,这可能不是一个可行的选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM