简体   繁体   中英

Getting only Null values in the right stream for left join in kafka streams

I'm trying to join two streams of data from two kafka topics.

Each topic has a key value pair, where key is an Integer datatype and value contains json in string format. The data from the two sources looks like the following examples below(key,value) :

2232, {"uniqueID":"2164103","ConsumerID":"63357","CategoryID":"8","BrandID":"5","ProductID":"2232","ProductDetails":"[]","Date":"2013-03-28","Flag":"0"}

1795, {"ProductName":"Frost Free","ProductID":"1795","BrandID":"16","BrandName":"ABC","CategoryID":"3"}

Now i'm trying to left join these two streams based on ProductID, so the key is set to ProductID for all these records. But unfortunately i'm continuously getting null values in the right stream value of joining. Not even a single record is getting properly joined. The following is my code to join the two records :

import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.streams.kstream.*;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.common.serialization.Serde;
import org.apache.kafka.common.serialization.Serdes;
import com.google.gson.Gson;
import com.google.gson.reflect.TypeToken;

import java.util.concurrent.TimeUnit;
import java.util.*;

public class Tester {
    public static void main(String[] args){
        final Properties streamsConfiguration = new Properties();

        final Serde<String> stringSerde = Serdes.String();
        final Serde<Integer> intSerde = Serdes.Integer();

        streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "join-streams");
        streamsConfiguration.put(StreamsConfig.CLIENT_ID_CONFIG, "joining-Client");
        streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");

        streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, intSerde.getClass().getName());
        streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, stringSerde.getClass().getName());

        streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 100);
        streamsConfiguration.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
        streamsConfiguration.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 9000);

        final KStreamBuilder builder = new KStreamBuilder();
        KStream<Integer,String> pData = builder.stream(intSerde,stringSerde,"Ptopic");
        KStream<Integer,String> streamData = builder.stream(intSerde,stringSerde,"Dtopic");
// Test the data type and value of the key
        pbData.selectKey((k,v)->{System.out.println("Table : P, Type : "+k.getClass()+" Value : "+k);return k;});
        streamData.selectKey((k,v)->{System.out.println("Table : StreamRecord, Type : "+k.getClass()+" Value : "+k);return k;});

        KStream<Integer,String> joined = streamData.leftJoin(pbData,(table1Value,table2Value)->returnJoin(table1Value,table2Value),JoinWindows.of(TimeUnit.SECONDS.toMillis(30)));

        final KafkaStreams streams = new KafkaStreams(builder, streamsConfiguration);
        streams.cleanUp();
        streams.start();

        // Add shutdown hook to respond to SIGTERM and gracefully close Kafka Streams
        Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
    }
    private static HashMap convertToHashMap(String jsonString, String tablename){
        try{
            HashMap<String,String> map = new Gson().fromJson(jsonString, new TypeToken<HashMap<String, String>>(){}.getType());
            return map;
        }
        catch(Exception x){
            //couldn't properly parse json
            HashMap<String,String> record = new HashMap<>();
            if (tablename.equals("PB")){
                List<String> keys = new ArrayList<>(Arrays.asList("ProductName", ", "CategoryID", "ProductID", "BrandID", "BrandName", "ProductCategoryID"));
                for(String key : keys){
                    record.put(key,null);
                }
            }
            else{
                List<String> keys = new ArrayList<>(Arrays.asList("UniqueID", "ConsumerID", "CategoryID", "BrandID", "ProductID", "Date","Flag","ProductDetails"));
                for(String key : keys){
                    record.put(key,null);
                }
            }
            return record;
        }
    }
    private static String returnJoin(String map1, String map2){
        HashMap h1 = convertToHashMap(map1,"consumer_product");
        HashMap h2 = convertToHashMap(map2,"PB");
        HashMap map3 = new HashMap<>();

        System.out.println("First : " + map1);
        System.out.println("Second : " + map2);
        //else{System.out.println("Null only");}
        for (Object key : h1.keySet()) {
            key = key.toString();
            if (map3.containsKey(key)) {
                continue;
            }
            map3.put(key, h1.get(key));
        }
        try {
            for (Object key : h2.keySet()) {
                key = key.toString();
                if (map3.containsKey(key)) {
                    continue;
                }
                map3.put(key, h2.get(key));
            }
            System.out.println("Worked Okay PB!!!\n--------------------------------------------------------------------------------------");
        }
        catch (NullPointerException ex){
            /*System.out.println("Exception\n----------------------------------------------------------------------------");
            HashMap fakeC = getHashMap("{","consumer");
            for (Object key : fakeC.keySet()) {
                key = key.toString();
                if (map3.containsKey(key)) {
                    continue;
                }
                map3.put(key, fakeC.get(key));
            }*/
            return "INVALID";
        }
        //return map3;
        return serializeObjectJSON(map3);
    }
    private static String serializeObjectJSON(Map row){
        StringBuilder jsonString = new StringBuilder();
        jsonString.append("{");
        for ( Object key : row.keySet()){
            jsonString.append("\""+key.toString()+"\":");
            try {
                jsonString.append("\"" + row.get(key).toString() + "\",");
            }
            catch (NullPointerException Nexp){
                jsonString.append("\"" + "null" + "\",");
            }

        }
        jsonString.deleteCharAt(jsonString.length()-1);
        jsonString.append("}");
        String jsString = jsonString.toString();
        ////System.out.println("JString :"+jsString);
        return jsString;
    }
}

I can't figure out why i'm only getting null in right stream of the left join, when i try to join the two streams either way, but when i try joining the same stream with itself, the join works.

I've made sure that the key type is Integer for all the records in both the streams and no null are present as i'm checking the types and key values for both the streams (can be checked in code above). And also that both streams have overlapping keys in order for join to happen, Since i thought either keys wouldn't overlap or maybe the datatype maybe different, because that is when we get null values in joins.

Can anyone help me figure out what i'm doing wrong?

Update :

The data in these two topics(on which i'm joining) comes from two streams. Where one of the streams is a stream of custom (Key,value) pair of type (Integer,recordHashmap) and other is just a stream of (Integer,string). Here recordHashmap is a custom object i've defined to parse nested json string to an object. Its definition is defined as below :

public class recordHashmap {
    private String database;
    private String table;
    private String type;
    private Integer ts;
    private Integer xid;
    private Map<String,String> data;

    public Map getdata(){
        return data;
    }
    public String getdatabase(){return database;}
    public String gettable(){return table;}
    public String gettype(){return type;}
    public Integer getts(){return ts;}
    public Integer getxid(){return xid;}

    public void setdata(Map<String, String> dta){
        data=dta;
    }
    public void setdatabase(String db){ database=db; }
    public void settable(String tble){ table=tble; }
    public void settype(String optype){type=optype;}
    public void setts(Integer unixTime){ts = unixTime;}
    public void setxid(Integer Xid){xid = Xid;}

    public String toString() {
        return "Database=" + this.database + ", Table=" + this.table+", OperationType="+this.type+", UnixOpTime"+this.ts + ", Data="
                + this.data;
    }

}

And the code to set the key to product id can be seen below :

KStream<Integer,recordHashmap> rekeyedProductID = inserts.selectKey((k,v)->setTheKey(v.getdata(),"ProductID"));
KStream<Integer,String> consumer_product_Stream = rekeyedProductID.mapValues((v)->serializeObjectJSON(v.getdata()));

And the function setTheKey is defined as

private static Integer setTheKey(Map map, String Key){
        try {
            //System.out.println("New Key : " + map.get(Key));
            return Integer.parseInt(map.get(Key).toString());
        }
        catch (NumberFormatException nmb){
            //fake return a custom value
            return -1;
        }
    }

The console logs example for the following two statements is shown below (Note : The overall logs are too large to be added, but the main thing is that both the stream keys are integer and keys overlap):

pbData.selectKey((k,v)->{System.out.println("Table : P, Type : "+k.getClass()+" Value : "+k);return k;});
streamData.selectKey((k,v)->{System.out.println("Table : StreamRecord, Type : "+k.getClass()+" Value : "+k);return k;});

Console Logs :

Table : streamRecord, Type:class java.lang.Integer Value:1342
Table : streamRecord, Type:class java.lang.Integer Value:595
Table : streamRecord, Type:class java.lang.Integer Value:1934
Table : streamRecord, Type:class java.lang.Integer Value:2384
Table : streamRecord, Type:class java.lang.Integer Value:1666
Table : streamRecord, Type:class java.lang.Integer Value:665
Table : streamRecord, Type:class java.lang.Integer Value:2671
Table : streamRecord, Type:class java.lang.Integer Value:949
Table : streamRecord, Type:class java.lang.Integer Value:2455
Table : streamRecord, Type:class java.lang.Integer Value:928
Table : streamRecord, Type:class java.lang.Integer Value:1602
Table : streamRecord, Type:class java.lang.Integer Value:74
Table : P, Type:class java.lang.Integer Value:2
Table : streamRecord, Type:class java.lang.Integer Value:1795
Table : P, Type:class java.lang.Integer Value:21
Table : streamRecord, Type:class java.lang.Integer Value:1265
Table : P, Type:class java.lang.Integer Value:22
Table : streamRecord, Type:class java.lang.Integer Value:2420
Table : P, Type:class java.lang.Integer Value:23
Table : streamRecord, Type:class java.lang.Integer Value:1419
Table : P, Type:class java.lang.Integer Value:24
Table : streamRecord, Type:class java.lang.Integer Value:1395
Table : P, Type:class java.lang.Integer Value:26
Table : streamRecord, Type:class java.lang.Integer Value:1783
Table : P, Type:class java.lang.Integer Value:29
Table : streamRecord, Type:class java.lang.Integer Value:1177
Table : P, Type:class java.lang.Integer Value:34
Table : streamRecord, Type:class java.lang.Integer Value:1395
Table : P, Type:class java.lang.Integer Value:35
Table : streamRecord, Type:class java.lang.Integer Value:2551
Table : P, Type:class java.lang.Integer Value:36
Table : P, Type:class java.lang.Integer Value:2551
Table : streamRecord, Type:class java.lang.Integer Value:2530
Table : P, Type:class java.lang.Integer Value:37
Table : streamRecord, Type:class java.lang.Integer Value:541
Table : P, Type:class java.lang.Integer Value:39
Table : streamRecord, Type:class java.lang.Integer Value:787
Table : P, Type:class java.lang.Integer Value:40
Table : streamRecord, Type:class java.lang.Integer Value:2498
Table : P, Type:class java.lang.Integer Value:41
Table : streamRecord, Type:class java.lang.Integer Value:1439
Table : P, Type:class java.lang.Integer Value:44
Table : streamRecord, Type:class java.lang.Integer Value:784
Table : P, Type:class java.lang.Integer Value:284
Table : P, Type:class java.lang.Integer Value:285
Table : P, Type:class java.lang.Integer Value:929
Table : P, Type:class java.lang.Integer Value:286
Table : P, Type:class java.lang.Integer Value:287
Table : P, Type:class java.lang.Integer Value:2225
Table : P, Type:class java.lang.Integer Value:288
Table : P, Type:class java.lang.Integer Value:289
Table : P, Type:class java.lang.Integer Value:290
Table : P, Type:class java.lang.Integer Value:295
Table : P, Type:class java.lang.Integer Value:297
Table : P, Type:class java.lang.Integer Value:300
Table : P, Type:class java.lang.Integer Value:302
Table : P, Type:class java.lang.Integer Value:305
Table : P, Type:class java.lang.Integer Value:306
Table : P, Type:class java.lang.Integer Value:307
Table : P, Type:class java.lang.Integer Value:308
Table : P, Type:class java.lang.Integer Value:309
Table : P, Type:class java.lang.Integer Value:310
Table : streamRecord, Type:class java.lang.Integer Value:929
Table : streamRecord, Type:class java.lang.Integer Value:1509
Table : streamRecord, Type:class java.lang.Integer Value:136
Table : streamRecord, Type:class java.lang.Integer Value:2225
Table : streamRecord, Type:class java.lang.Integer Value:906
Table : streamRecord, Type:class java.lang.Integer Value:1013
Table : streamRecord, Type:class java.lang.Integer Value:1759
Table : streamRecord, Type:class java.lang.Integer Value:1759
Table : streamRecord, Type:class java.lang.Integer Value:885
Table : streamRecord, Type:class java.lang.Integer Value:1165
Table : streamRecord, Type:class java.lang.Integer Value:453

Update-2 : The Interesting thing here is that leftJoin is working fine for KTables for the same set of key-value pairs. But not for KStreams for some reason. But i need to use KStreams since i have many records pertaining to a key. Generally most of the time this join on streams works like a charm but its acting weirdly in this particular case. I'm guessing this could be to do with RocksDB or internal caching.

It seems that you don't set the ProductID as key:

pbData.selectKey((k,v)->{System.out.println("Table : P, Type : "+k.getClass()+" Value : "+k);return k;});
streamData.selectKey((k,v)->{System.out.println("Table : StreamRecord, Type : "+k.getClass()+" Value : "+k);return k;});

In both statement, you return the original key -> return k ; instead to parse the productId from the JSON and return it.

Update

I am still not sure if I can put all pieces together correctly, as in your update, you use

KStream<Integer,recordHashmap> rekeyedProductID = inserts.selectKey((k,v)->setTheKey(v.getdata(),"ProductID"));
KStream<Integer,String> consumer_product_Stream = 

rekeyedProductID.mapValues((v)->serializeObjectJSON(v.getdata()));

and it's unclear what inserts and rekeyedProductID are (what are the types?). Anyway, I assume this part is just correct. As you mention that it work if the right-hand side is a KTable (using the same data), I just assume that you join window is not large enough, such that two records with the same key are farther away (in time) from each other than your specified 30 seconds. Can you double check the record timestamps of both input streams? (cf https://docs.confluent.io/current/streams/faq.html#accessing-record-metadata-such-as-topic-partition-and-offset-information )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM