简体繁体 English

Logstash并从关系表中查找其他数据？

[英]Logstash and looking up additional data from a relational table?

原文 2019-04-16 21:30:56 5 1 amazon-web-services/ elasticsearch/ aws-lambda/ logstash/ kibana

I have mobile app log data being posted daily (eventually it will be a data stream). 我每天都会发布移动应用日志数据（最终将是数据流）。 I am looking at different solutions for processing this log data and providing analytics. 我正在寻找用于处理此日志数据和提供分析的不同解决方案。 I am considering using logstash/elasticsearch/kibana combination, but we have additional data on our users stored in a redshift database. 我正在考虑使用Logstash / elasticsearch / kibana组合，但是我们在Redshift数据库中存储了有关用户的其他数据。 So in addition to the mobile data, I would like to pull in additional data from redshift about the user at the time of interaction with mobile app. 因此，除了移动数据外，我还想在与移动应用进行交互时从redshift中获取有关用户的其他数据。

However, I've read in some places that doing an actual database query through logstash isn't feasible, but you can use a dictionary file to do a lookup of each user. 但是，我在某些地方读到，通过logstash进行实际的数据库查询是不可行的，但是您可以使用字典文件对每个用户进行查找。

I have two questions regarding this approach 关于此方法，我有两个问题

Is there a limit to have large this lookup file can be? 此查找文件可以有很大的限制吗？ Mine would be < 500K records so I'd imagine it would be fine? 我的记录会少于50万，所以我想这会很好吗？
Can the process of making the the lookup file from redshift tables be fully automated (ideally though aws services) - ie each night the lookup table is refreshed and posted to logstash, and then used for breakouts in Kibana 从redshift表制作查找文件的过程是否可以完全自动化（理想情况下是通过aws服务）-即每天晚上将查找表刷新并发布到logstash，然后用于Kibana中的分组讨论

The way we're currently doing it is processing a daily jason file with a lambda function, posting it to s3 and then reading it into a redshift table. 我们目前正在使用的方法是使用lambda函数处理每日的jason文件，将其发布到s3，然后将其读取到redshift表中。 This data is then processed into sessions and joined up with other tables to generate the final dataset to be used for visualization. 然后，将这些数据处理为会话，并与其他表合并以生成最终的数据集以用于可视化。 This is currently done in Tableau but we are exploring other options (such as quicksight, or possibly the ELK stack) 目前，这是在Tableau中完成的，但我们正在探索其他选项（例如快速见解，或者可能是ELK堆栈）

Just trying to figure out what solution is going to be scalable to clickstream data and will be the most useful down the line. 只是想弄清楚哪种解决方案将可扩展到点击流数据，这将是最有用的方法。

Thanks! 谢谢！

1 个解决方案

logstash 7 has a jdbc_streaming filter plugin for dynamically adding stuff to your events, as well as the jdbc_static filter for static stuff. logstash 7具有一个用于将事件动态添加内容的jdbc_streaming过滤器插件，以及一个用于静态内容的jdbc_static过滤器。

As you found, you can also use the translate filter. 如您所见，您还可以使用翻译过滤器。 The man page says they've tested "very large" datasets up to 100,000 entries, so your dataset may require some testing. 手册页上说，他们已经测试了多达100,000个条目的“大型”数据集，因此您的数据集可能需要进行一些测试。 The good part about this filter is that it will reload the data when it detects a change, so you can publish the data on your own schedule (eg cron) without restarting logstash. 关于此过滤器的好处在于，它会在检测到更改时重新加载数据，因此您可以按自己的时间表（例如cron）发布数据，而无需重新启动logstash。 Be on the lookout for events that don't get the translated value, which might be a sign that your publishing frequency should be updated. 请注意未获得转换值的事件，这可能表明您的发布频率应被更新。