简体   繁体   English

可以使用Apache Storm来处理具有动态属性集的元组吗?

[英]Can Apache Storm be used to process tuples with a dynamic set of properties?

I am currently evaluating Apache Storm to process heterogeneous data from multiple data sources. 我目前正在评估Apache Storm,以处理来自多个数据源的异构数据。 While there may be some common properties shared by all data (ie, a "type" property), I would like to be able many different "classes" of tuples and also be able to handle new data types with minimal changes to the topology. 虽然所有数据可能都有一些共有的属性(即“类型”属性),但我希望能够有许多不同的元组“类”,并且还能够以最小的拓扑更改来处理新的数据类型。 To give an example what these data types might look like: 举例说明这些数据类型可能是什么样的:

{type=LogTransaction,timestamp=...,user=...,duration=...}
{type=LogEvent,timestamp=...,user=...,message=...}

The examples on the Storm page primarily deal with simple Tuples which are well-defined in advance so that the spouts / bolts can statically declare the output fields. “风暴”页面上的示例主要处理简单的元组,这些元组事先进行了明确定义,以便喷口/螺栓可以静态声明输出字段。

My initial idea was to declare the type and store all other properties in a Map<String,Object> , which could then be declared: 我最初的想法是声明类型并将所有其他属性存储在Map<String,Object> ,然后可以声明该属性:

public void declareOutputFields(OutputFieldsDeclarer ofd) {
    ofd.declare(new Fields("type", "attributes"));
}

However, I believe at that point many of the more advanced features of Storm will no longer work correctly. 但是,我相信那时Storm的许多更高级的功能将无法正常工作。 For example, it it my understanding that I could no longer use Trident to execute a groupBy on any of the attributes. 例如,据我了解,我不再可以使用Trident对任何属性执行groupBy

Is there a better way to handle this type of data that I have missed in Apache Storm? 有没有更好的方法来处理我在Apache Storm中丢失的此类数据? I did find this post describing a similar issue, however I would like to avoid having to create a Java class for each data type. 我也发现这个帖子描述了类似的问题,但我想避免创建一个Java类为每个数据类型。

You can use your own customized fields as long as the field is serializable , It will work fine in storm with more than one supervisor. 您可以使用自己的自定义字段,只要该字段是可序列化的,它将在风暴中与多个主管一起正常工作。

Because storm is a distributed data processing tool and when there exists more than one supervisor, based on grouping, certain bolts will emit the fields to bolts running on different supervisor. 因为Storm是一种分布式数据处理工具,并且当存在多个主管时,基于分组,某些螺栓会将字段发射到在不同主管上运行的螺栓。 In such sutiuations, the output fields will be serialized and sent through network. 在这种情况下,输出字段将被序列化并通过网络发送。 This serialization can be of regular java serialization or Kryo serialization(to avoid network latency). 该序列化可以是常规的Java序列化或Kryo序列化(以避免网络延迟)。

Hence you might experience exceptions if your jvm not able to serialize your output fields. 因此,如果您的jvm无法序列化您的输出字段,则可能会遇到异常。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM