简体   繁体   中英

How to modify message from Twitter API saved in one kafka topic and send it to another kafka topic

I have created Kafka producer which produces messages from Twitter API using hbc-core to one topic and I want to modify that messages because I need only few fields like tweet creation time, id string, some basic info about the user and text from that tweet. I tried to use Kafka Streams and POJO model but I had problem with extracting text because the full text may be in different named fields depending on if tweet was retweeted, had more than 140 signs etc. My POJO model:

  "type": "object",
  "properties": {
    "created_at": { "type": "string" },
    "id_str": { "type": "string" },
    "user": {
      "type": "object",
      "properties": {
        "location": { "type": "string" },
        "followers_count": { "type": "integer" },
        "friends_count": { "type": "integer" },
        "created_at": { "type": "string" }
      }
    },
    "text": { "type": "string" }
  }
}

Is that a right way to use Kafka Streams or there is a better solution to extract those fields and put to another topic?

Without the need of intermediate clients, systems, kafka streams fancy utils, or miracle frameworks , focus on the old-simple way: reusing that producer and the code that generates and sends the full POJOS.

Producers are thread safe, so it's completely fine using the same producer instance to produce to two or many topics by different threads.

This is just a simplification, as I can't know the details of your implementation. I assume the message (POJO) is a simple String . Using some imagination, just believe the different letters are fields. From the fullPojo , you'd like to send a message containing just two fields, represented as y and v , to another topic.

  String fullPojo = "xxxxxyxxxxv";
  //some logic to extract the desired fields
  String shortPojo = getDesiredFields(fullPojo);
  /* shortPojo="yv" */

Create a new topic on your Kafka cluster, for this example, it will be called shortPojoTopic .

Just use the same producer that sends the full data to your original topic, by making a second call in order to fill the short Topic with the message containing only the filtered values:

producer.send(new ProducerRecord<String, String>(fullPojoTopic,  fullPojo));
producer.send(new ProducerRecord<String, String>(shortPojoTopic, shortPojo));

This second call could be done from another secondary thread as well. If you wish to acomplish multithreading here, you could define a second thread that makes the "filtering" job. Just pass the original producer reference to this second thread and link both threads with something like a FIFO structure ( deques, queues ,...) that holds the fullPojos.

  • The original thread sends the fullPojo to the fullPojoTopic topic, and pushes the fullPojo into the queue.
  • This secondary "filterer" thread would remove the top message from the queue/deque, extract the desired fields creating the shortPojo, and send it to the shortPojoTopic ( using the same producer , without worrying about producer synchronization issues ).

The second thread would also avoid locking the entire system if one of the topics is in bad state and can't accept more messages, or one of the topics is located on a different Kafka cluster that just failed (in this case you will need two different producers as well), or even if the filtering process finds some difficulties while filtering certain malformed messages. For example, even if the shortPojoTopic is out , that won't affect the first thread's perfomance, as it will continue sending his fullPojos without issues/delays.

Always beware of memory use: The queue/deque size should be limited/controlled in some way to avoid OOM if the second thread is stuck by a large amount of time, or if it can't follow the rythm of the first thread. If this happens, it won't be able to read/remove the messages fast enough, hence generating a lag that could lead to the mentioned OOM problem

Furthermore, even if no topic/broker has issues, this separation also would improve the general performance, as the original thread won't have to wait for the filtering process that would happen in his thread on each iteration.

The first thread just sends the POJOs; The second thread just filters and sends the short POJOs. Simple responsabilities , all in parallel.

Assuming you have the control of the producer and the content it sends, I recommend placing the logic directly there, in order to avoid other intermediate systems (streams,...). Just extract the fields in your core code and produce the resumed Pojo to another topic, using the same producer. Using just one thread or as much as you desire.

I'd bet my own house and the right hand that this is much, much faster than any streams utility you could think of.

If you don't have access to that code, you could create an intermediate consumer-producer service, resumed in the next section.


  • If the code for original POJO generation and production is not accesible

If you only have access to the full POJO topic, and not to the previous step (the code that generates the messages and sends them to the topic), the second option could be creating an intermediate kafka consumer-producer, which consumes the messages from the fullPojoTopic , extracts the fields and produces the filtered shortPojo to the shortPojoTopic .

Note that the logic is the same one as in the first approach, but this solution implies a much greater waste of resources : new producer and consumer threads (trust me, they create a lot of secondary threads), a new consumer group to manage, double transport of fullPOJO messages on the wire , etc..

My oppinion is that this option should only be used if you don't have direct access to the code that generates and produces the full POJOS in first way, or you wish to have a greater control of the microservice that filters the full data and sends the desired fields to another topic.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM