简体   繁体   English

如何修改保存在一个kafka主题中的来自Twitter API的消息并将其发送到另一个kafka主题

[英]How to modify message from Twitter API saved in one kafka topic and send it to another kafka topic

I have created Kafka producer which produces messages from Twitter API using hbc-core to one topic and I want to modify that messages because I need only few fields like tweet creation time, id string, some basic info about the user and text from that tweet.我创建了 Kafka 生产者,它使用 hbc-core 从 Twitter API 为一个主题生成消息,我想修改这些消息,因为我只需要很少的字段,比如推文创建时间、id 字符串、一些关于用户的基本信息和推文中的文本. I tried to use Kafka Streams and POJO model but I had problem with extracting text because the full text may be in different named fields depending on if tweet was retweeted, had more than 140 signs etc. My POJO model:我尝试使用 Kafka Streams 和 POJO model,但我在提取文本时遇到问题,因为全文可能位于不同的命名字段中,具体取决于推文是否被转发,是否有超过 140 个标志等。我的 POJO model:

  "type": "object",
  "properties": {
    "created_at": { "type": "string" },
    "id_str": { "type": "string" },
    "user": {
      "type": "object",
      "properties": {
        "location": { "type": "string" },
        "followers_count": { "type": "integer" },
        "friends_count": { "type": "integer" },
        "created_at": { "type": "string" }
      }
    },
    "text": { "type": "string" }
  }
}

Is that a right way to use Kafka Streams or there is a better solution to extract those fields and put to another topic?这是使用 Kafka Streams 的正确方法,还是有更好的解决方案来提取这些字段并放入另一个主题?

Without the need of intermediate clients, systems, kafka streams fancy utils, or miracle frameworks , focus on the old-simple way: reusing that producer and the code that generates and sends the full POJOS.无需中间客户端、系统、kafka 流花哨的实用程序或奇迹框架,专注于旧的简单方法:重用该生产者以及生成和发送完整 POJOS 的代码。

Producers are thread safe, so it's completely fine using the same producer instance to produce to two or many topics by different threads. Producers是线程安全的,所以使用同一个生产者实例通过不同线程生产两个或多个主题是完全没问题的。

This is just a simplification, as I can't know the details of your implementation.这只是一个简化,因为我不知道您的实施细节。 I assume the message (POJO) is a simple String .我假设消息 (POJO) 是一个简单的String Using some imagination, just believe the different letters are fields.发挥一些想象力,相信不同的字母是字段。 From the fullPojo , you'd like to send a message containing just two fields, represented as y and v , to another topic.fullPojo中,您想要将仅包含两个字段(表示为yv )的消息发送到另一个主题。

  String fullPojo = "xxxxxyxxxxv";
  //some logic to extract the desired fields
  String shortPojo = getDesiredFields(fullPojo);
  /* shortPojo="yv" */

Create a new topic on your Kafka cluster, for this example, it will be called shortPojoTopic .在您的 Kafka 集群上创建一个新主题,对于此示例,它将被称为shortPojoTopic

Just use the same producer that sends the full data to your original topic, by making a second call in order to fill the short Topic with the message containing only the filtered values:只需使用将完整数据发送到原始主题的同一producer ,通过进行第二次调用以使用仅包含过滤值的消息填充短主题:

producer.send(new ProducerRecord<String, String>(fullPojoTopic,  fullPojo));
producer.send(new ProducerRecord<String, String>(shortPojoTopic, shortPojo));

This second call could be done from another secondary thread as well.第二次调用也可以从另一个辅助线程完成。 If you wish to acomplish multithreading here, you could define a second thread that makes the "filtering" job.如果你想在这里完成多线程,你可以定义第二个线程来进行“过滤”工作。 Just pass the original producer reference to this second thread and link both threads with something like a FIFO structure ( deques, queues ,...) that holds the fullPojos.只需将原始producer引用传递给第二个线程,并将两个线程与类似FIFO结构( deques, queues等)的东西链接起来,该结构包含 fullPojos。

  • The original thread sends the fullPojo to the fullPojoTopic topic, and pushes the fullPojo into the queue.原线程将fullPojo发送到fullPojoTopic主题,并将fullPojo推入队列。
  • This secondary "filterer" thread would remove the top message from the queue/deque, extract the desired fields creating the shortPojo, and send it to the shortPojoTopic ( using the same producer , without worrying about producer synchronization issues ).这个辅助“过滤器”线程将从队列/双端队列中删除顶部消息,提取创建 shortPojo 所需的字段,并将其发送到shortPojoTopic使用相同的producer ,无需担心生产者同步问题)。

The second thread would also avoid locking the entire system if one of the topics is in bad state and can't accept more messages, or one of the topics is located on a different Kafka cluster that just failed (in this case you will need two different producers as well), or even if the filtering process finds some difficulties while filtering certain malformed messages.如果其中一个主题处于错误状态 state 并且无法接受更多消息,或者其中一个主题位于刚刚失败的另一个 Kafka 集群上(在这种情况下,您将需要两个不同的生产者),或者即使过滤过程在过滤某些格式错误的消息时发现一些困难。 For example, even if the shortPojoTopic is out , that won't affect the first thread's perfomance, as it will continue sending his fullPojos without issues/delays.例如,即使shortPojoTopic出局,也不会影响第一个线程的性能,因为它会继续发送他的 fullPojos 而不会出现问题/延迟。

Always beware of memory use: The queue/deque size should be limited/controlled in some way to avoid OOM if the second thread is stuck by a large amount of time, or if it can't follow the rythm of the first thread.始终提防 memory 使用:如果第二个线程被大量时间卡住,或者如果它不能跟随第一个线程的节奏,则应以某种方式限制/控制队列/双端队列的大小以避免 OOM。 If this happens, it won't be able to read/remove the messages fast enough, hence generating a lag that could lead to the mentioned OOM problem如果发生这种情况,它将无法足够快地读取/删除消息,从而产生可能导致提到的 OOM 问题的延迟

Furthermore, even if no topic/broker has issues, this separation also would improve the general performance, as the original thread won't have to wait for the filtering process that would happen in his thread on each iteration.此外,即使没有主题/代理有问题,这种分离也会提高总体性能,因为原始线程不必等待每次迭代时在他的线程中发生的过滤过程。

The first thread just sends the POJOs;第一个线程只发送 POJO; The second thread just filters and sends the short POJOs.第二个线程只是过滤并发送短 POJO。 Simple responsabilities , all in parallel.简单的职责,所有并行。

Assuming you have the control of the producer and the content it sends, I recommend placing the logic directly there, in order to avoid other intermediate systems (streams,...).假设您可以控制生产者及其发送的内容,我建议将逻辑直接放在那里,以避免其他中间系统(流,...)。 Just extract the fields in your core code and produce the resumed Pojo to another topic, using the same producer.只需提取核心代码中的字段,并使用相同的生成器将恢复的 Pojo 生成到另一个主题。 Using just one thread or as much as you desire.只使用一个线程或尽可能多地使用。

I'd bet my own house and the right hand that this is much, much faster than any streams utility you could think of.我敢打赌,这比您能想到的任何流实用程序都要快得多。

If you don't have access to that code, you could create an intermediate consumer-producer service, resumed in the next section.如果您无权访问该代码,则可以创建一个中间消费者-生产者服务,下一节将继续介绍。


  • If the code for original POJO generation and production is not accesible如果原始 POJO 生成和生产的代码不可访问

If you only have access to the full POJO topic, and not to the previous step (the code that generates the messages and sends them to the topic), the second option could be creating an intermediate kafka consumer-producer, which consumes the messages from the fullPojoTopic , extracts the fields and produces the filtered shortPojo to the shortPojoTopic .如果您只能访问完整的 POJO 主题,而不能访问上一步(生成消息并将它们发送到主题的代码),则第二个选项可能是创建一个中间的 kafka 消费者-生产者,它使用来自fullPojoTopic ,提取字段并将过滤后的 shortPojo 生成到shortPojoTopic

Note that the logic is the same one as in the first approach, but this solution implies a much greater waste of resources : new producer and consumer threads (trust me, they create a lot of secondary threads), a new consumer group to manage, double transport of fullPOJO messages on the wire , etc..请注意,逻辑与第一种方法相同,但此解决方案意味着更大的资源浪费新的生产者和消费者线程(相信我,他们创建了很多辅助线程),一个新的消费者组来管理,在线路上双重传输完整的 POJO 消息等。

My oppinion is that this option should only be used if you don't have direct access to the code that generates and produces the full POJOS in first way, or you wish to have a greater control of the microservice that filters the full data and sends the desired fields to another topic.我的意见是,仅当您无法直接访问以第一种方式生成和生成完整 POJOS 的代码,或者您希望更好地控制过滤完整数据并发送的微服务时,才应使用此选项所需的字段到另一个主题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM