Hyperledger fabric Orderers will not connect to Kafka Brokers from a different machine

Question

I have a hyperledger fabric network setup on AWS with different components on different EC2 machines. I can move things around to where ever I want physically except for the brokers and orderers. They seem to need to sit on the same machine.

If I spin up my zookeepers, kafka brokers and orderers with one compose file on one EC2 machine everything works fine. I then tried spinning up the zookeepers and kafka brokers on one EC2, gave them time to settle and then spun my orderers on another EC2 machine.

When the orderers spin up, they seem to initially connect to the brokers and the broker logs spit out stuff for creating a new topic, partition, etc for testchainid (the channel every orderer creates as a dummy right at the start)

[2018-05-09 01:48:10,978] INFO Topic creation {"version":1,"partitions":{"0":[2,3,0]}} (kafka.admin.AdminUtils$)
[2018-05-09 01:48:10,988] INFO [KafkaApi-0] Auto creation of topic testchainid with 1 partitions and replication factor 3 is successful (kafka.server.KafkaApis)
[2018-05-09 01:48:10,999] INFO Topic creation {"version":1,"partitions":{"0":[2,3,0]}} (kafka.admin.AdminUtils$)
[2018-05-09 01:48:11,059] INFO Replica loaded for partition testchainid-0 with initial high watermark 0 (kafka.cluster.Replica)
[2018-05-09 01:48:11,062] INFO Replica loaded for partition testchainid-0 with initial high watermark 0 (kafka.cluster.Replica)
[2018-05-09 01:48:11,152] INFO Loading producer state from offset 0 for partition testchainid-0 with message format version 2 (kafka.log.Log)
[2018-05-09 01:48:11,159] INFO Completed load of log testchainid-0 with 1 log segments, log start offset 0 and log end offset 0 in 49 ms (kafka.log.Log)
[2018-05-09 01:48:11,164] INFO Created log for partition [testchainid,0] in /tmp/kafka-logs with properties {compression.type -> producer, message.format.version -> 1.0-IV0, file.delete.delay.ms -> 60000, max.message.bytes -> 103809024, min.compaction.lag.ms -> 0, message.timestamp.type -> CreateTime, min.insync.replicas -> 2, segment.jitter.ms -> 0, preallocate -> false, min.cleanable.dirty.ratio -> 0.5, index.interval.bytes -> 4096, unclean.leader.election.enable -> false, retention.bytes -> -1, delete.retention.ms -> 86400000, cleanup.policy -> [delete], flush.ms -> 9223372036854775807, segment.ms -> 604800000, segment.bytes -> 1073741824, retention.ms -> -1, message.timestamp.difference.max.ms -> 9223372036854775807, segment.index.bytes -> 10485760, flush.messages -> 9223372036854775807}. (kafka.log.LogManager)
[2018-05-09 01:48:11,165] INFO [Partition testchainid-0 broker=0] No checkpointed highwatermark is found for partition testchainid-0 (kafka.cluster.Partition)
[2018-05-09 01:48:11,165] INFO Replica loaded for partition testchainid-0 with initial high watermark 0 (kafka.cluster.Replica)
[2018-05-09 01:48:11,167] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions testchainid-0 (kafka.server.ReplicaFetcherManager)
[2018-05-09 01:48:11,241] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Starting (kafka.server.ReplicaFetcherThread)
[2018-05-09 01:48:11,242] INFO [ReplicaFetcherManager on broker 0] Added fetcher for partitions List([testchainid-0, initOffset 0 to broker BrokerEndPoint(2,0fb5eecc47b8,9092)] ) (kafka.server.ReplicaFetcherManager)
[2018-05-09 01:48:11,306] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Based on follower's leader epoch, leader replied with an offset 0 >= the follower's log end offset 0 in testchainid-0. No truncation needed. (kafka.server.ReplicaFetcherThread)
[2018-05-09 01:48:11,311] INFO Truncating testchainid-0 to 0 has no effect as the largest offset in the log is -1. (kafka.log.Log)

but then they can't seem to maintain a connection.

This is the log dump of when my orderer tries to connect to a broker:

2018-05-09 01:49:12.483 UTC [orderer/consensus/kafka] try -> DEBU 0f9 [channel: testchainid] Attempting to post the CONNECT message...
2018-05-09 01:49:12.483 UTC [orderer/consensus/kafka/sarama] newHighWatermark -> DEBU 0fa producer/leader/testchainid/0 state change to [retrying-1]
2018-05-09 01:49:12.483 UTC [orderer/consensus/kafka/sarama] newHighWatermark -> DEBU 0fb producer/leader/testchainid/0 abandoning broker 2
2018-05-09 01:49:12.483 UTC [orderer/consensus/kafka/sarama] shutdown -> DEBU 0fc producer/broker/2 shut down
2018-05-09 01:49:12.583 UTC [orderer/consensus/kafka/sarama] tryRefreshMetadata -> DEBU 0fd client/metadata fetching metadata for [testchainid] from broker kafka0.hyperfabric.xyz:9092
2018-05-09 01:49:12.778 UTC [orderer/consensus/kafka/sarama] Validate -> DEBU 0fe ClientID is the default of 'sarama', you should consider setting it to something application-specific.
2018-05-09 01:49:12.779 UTC [orderer/consensus/kafka/sarama] run -> DEBU 0ff producer/broker/2 starting up
2018-05-09 01:49:12.779 UTC [orderer/consensus/kafka/sarama] run -> DEBU 100 producer/broker/2 state change to [open] on testchainid/0
2018-05-09 01:49:12.779 UTC [orderer/consensus/kafka/sarama] dispatch -> DEBU 101 producer/leader/testchainid/0 selected broker 2
2018-05-09 01:49:12.779 UTC [orderer/consensus/kafka/sarama] flushRetryBuffers -> DEBU 102 producer/leader/testchainid/0 state change to [flushing-1]
2018-05-09 01:49:12.779 UTC [orderer/consensus/kafka/sarama] flushRetryBuffers -> DEBU 103 producer/leader/testchainid/0 state change to [normal]
2018-05-09 01:49:12.790 UTC [orderer/consensus/kafka/sarama] func1 -> DEBU 104 Failed to connect to broker 0fb5eecc47b8:9092: dial tcp: lookup 0fb5eecc47b8 on 127.0.0.11:53: no such host
2018-05-09 01:49:12.790 UTC [orderer/consensus/kafka/sarama] handleError -> DEBU 105 producer/broker/2 state change to [closing] because dial tcp: lookup 0fb5eecc47b8 on 127.0.0.11:53: no such host
2018-05-09 01:49:12.790 UTC [orderer/consensus/kafka/sarama] newHighWatermark -> DEBU 106 producer/leader/testchainid/0 state change to [retrying-2]
2018-05-09 01:49:12.790 UTC [orderer/consensus/kafka/sarama] newHighWatermark -> DEBU 107 producer/leader/testchainid/0 abandoning broker 2
2018-05-09 01:49:12.790 UTC [orderer/consensus/kafka/sarama] shutdown -> DEBU 108 producer/broker/2 shut down
2018-05-09 01:49:12.890 UTC [orderer/consensus/kafka/sarama] tryRefreshMetadata -> DEBU 109 client/metadata fetching metadata for [testchainid] from broker kafka0.hyperfabric.xyz:9092
2018-05-09 01:49:13.086 UTC [orderer/consensus/kafka/sarama] Validate -> DEBU 10a ClientID is the default of 'sarama', you should consider setting it to something application-specific.
2018-05-09 01:49:13.086 UTC [orderer/consensus/kafka/sarama] run -> DEBU 10b producer/broker/2 starting up
2018-05-09 01:49:13.086 UTC [orderer/consensus/kafka/sarama] run -> DEBU 10c producer/broker/2 state change to [open] on testchainid/0
2018-05-09 01:49:13.086 UTC [orderer/consensus/kafka/sarama] dispatch -> DEBU 10d producer/leader/testchainid/0 selected broker 2
2018-05-09 01:49:13.086 UTC [orderer/consensus/kafka/sarama] flushRetryBuffers -> DEBU 10e producer/leader/testchainid/0 state change to [flushing-2]
2018-05-09 01:49:13.086 UTC [orderer/consensus/kafka/sarama] flushRetryBuffers -> DEBU 10f producer/leader/testchainid/0 state change to [normal]
2018-05-09 01:49:13.091 UTC [orderer/consensus/kafka/sarama] func1 -> DEBU 110 Failed to connect to broker 0fb5eecc47b8:9092: dial tcp: lookup 0fb5eecc47b8 on 127.0.0.11:53: no such host
2018-05-09 01:49:13.092 UTC [orderer/consensus/kafka/sarama] handleError -> DEBU 111 producer/broker/2 state change to [closing] because dial tcp: lookup 0fb5eecc47b8 on 127.0.0.11:53: no such host
2018-05-09 01:49:13.092 UTC [orderer/consensus/kafka/sarama] newHighWatermark -> DEBU 112 producer/leader/testchainid/0 state change to [retrying-3]
2018-05-09 01:49:13.092 UTC [orderer/consensus/kafka/sarama] newHighWatermark -> DEBU 113 producer/leader/testchainid/0 abandoning broker 2
2018-05-09 01:49:13.092 UTC [orderer/consensus/kafka/sarama] shutdown -> DEBU 114 producer/broker/2 shut down
2018-05-09 01:49:13.192 UTC [orderer/consensus/kafka/sarama] tryRefreshMetadata -> DEBU 115 client/metadata fetching metadata for [testchainid] from broker kafka0.hyperfabric.xyz:9092
2018-05-09 01:49:13.387 UTC [orderer/consensus/kafka/sarama] Validate -> DEBU 116 ClientID is the default of 'sarama', you should consider setting it to something application-specific.
2018-05-09 01:49:13.388 UTC [orderer/consensus/kafka/sarama] run -> DEBU 117 producer/broker/2 starting up
2018-05-09 01:49:13.388 UTC [orderer/consensus/kafka/sarama] run -> DEBU 118 producer/broker/2 state change to [open] on testchainid/0
2018-05-09 01:49:13.388 UTC [orderer/consensus/kafka/sarama] dispatch -> DEBU 119 producer/leader/testchainid/0 selected broker 2
2018-05-09 01:49:13.388 UTC [orderer/consensus/kafka/sarama] flushRetryBuffers -> DEBU 11a producer/leader/testchainid/0 state change to [flushing-3]
2018-05-09 01:49:13.388 UTC [orderer/consensus/kafka/sarama] flushRetryBuffers -> DEBU 11b producer/leader/testchainid/0 state change to [normal]
2018-05-09 01:49:13.427 UTC [orderer/consensus/kafka/sarama] func1 -> DEBU 11c Failed to connect to broker 0fb5eecc47b8:9092: dial tcp: lookup 0fb5eecc47b8 on 127.0.0.11:53: no such host
2018-05-09 01:49:13.427 UTC [orderer/consensus/kafka/sarama] handleError -> DEBU 11d producer/broker/2 state change to [closing] because dial tcp: lookup 0fb5eecc47b8 on 127.0.0.11:53: no such host
2018-05-09 01:49:13.427 UTC [orderer/consensus/kafka] try -> DEBU 11e [channel: testchainid] Need to retry because process failed = dial tcp: lookup 0fb5eecc47b8 on 127.0.0.11:53: no such host

I also noticed that it mentions the IP and PORT 127.0.0.11:53. I did a quick search and it is related to docker swarm? Do I have to use docker swarm to make this work? I prefer not to as so far everything else has not needed it and has just connected through IP addresses.

I don't know what the exact problem is and where to start looking for answers. Any assistance would be appreciated.

-- Update --

I did some more experimenting and I was able to move the zookeepers to a different machine also. It is now solely the kafka brokers and orderers that will not communicate with each other unless they are on the same machine.

I have a feeling I haven't specified the kafka broker address somewhere and that is why the log has 127.0.0.11:53 instead of their actual address. I currently have only specified the address in the configtx file, which is used for the genesis block. Does it need to be set somewhere else also? How does the orderer know where the kafka brokers are?

My configtx:

Kafka:
    # Brokers: A list of Kafka brokers to which the orderer connects
    # NOTE: Use IP:port notation
    Brokers:
        - kafka0.hyperfabric.xyz:9092
        - kafka1.hyperfabric.xyz:10092
        - kafka2.hyperfabric.xyz:11092
        - kafka3.hyperfabric.xyz:12092

Answer 1

The way Kafka works, the client connects into a broker from an initial list (in the Fabric case, provided through the channel config). Then that broker replies with a list of information about other brokers that exist, which partitions they are leaders of, etc. The client then uses this second list to pick the correct broker to connect to.

What you are seeing, is that the 'second list' is reporting using unresolvable hostnames, because they are running in containers on another machine. To fix this issue, you may have the Kafka broker advertise a different hostname than the one which the container advertises to it locally. This is the "advertised.host.name" from the Kafka documentation. Note, there are also advertised port names you may with to override. For the Fabric provided images, you may set KAFKA_ADVERTISED_HOST_NAME and KAFKA_ADVERTISED_PORT environment variables to set this override.

Set a name which is resolvable externally, and your problems should be solved.

Hyperledger fabric Orderers will not connect to Kafka Brokers from a different machine

Question

1 answers

solution1
1 ACCPTED 2018-05-10 18:27:49

Hyperledger fabric Orderers will not connect to Kafka Brokers from a different machine

Question

1 answers

solution1 1 ACCPTED 2018-05-10 18:27:49

solution1
1 ACCPTED 2018-05-10 18:27:49