[英]Hadoop timing out trying to write to Cassandra in AWS multi-region configuration
我在AWS中運行一個多DC Cassandra(開源,而不是DSE)集群,其中一個DC(us-west-2)設置為進行分析,另一個DC(us-east)為事務存儲。 我將NetworkTopologyStrategy與EC2 snitch一起使用,並且在Hadoop配置中使用LOCAL_ONE的一致性級別。 Hadoop 可以毫無問題地從Cassandra讀取 ,但是嘗試寫入會產生超時異常 。
運行的nodetool status
顯示DC已正確配置:
Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Token Rack
UN x.x.x.x 1.01 GB 9.9% 9e7f4393-7ac9-4559-b3ff-de48be50016f -9127921345534057723 2a
UN x.x.x.x 1001.16 MB 11.4% d0760383-c3dd-474c-9261-239b71dba3f1 -9221279003374097975 2b
UN x.x.x.x 1.05 GB 11.7% 3f09fbf5-0d85-4283-9009-0ec0e29223c0 -9140104347498952504 2c
Datacenter: us-east
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Token Rack
UN x.x.x.x 1.1 GB 11.3% 5bbd2de4-e1d2-4a17-9f40-034f60b35954 -9061054426204373981 1b
UN x.x.x.x 1.15 GB 11.5% e34c590e-6176-45b2-a8f9-18b4a9a80032 -9216519687724118609 1c
UN x.x.x.x 1.18 GB 10.9% fa0b0a1a-f156-40fc-a267-970d1eb9cddb -9207673937991303291 1a
UN x.x.x.x 1.46 GB 10.7% b18ae406-c9ec-42b7-a365-b0c6e2fe582f -9206671929961171506 1a
UN x.x.x.x 1.13 GB 11.4% 1ac9c1c5-55ad-4048-b1ba-3b9768933ecc -9146100851344467112 1c
UN x.x.x.x 1.53 GB 11.2% dad665bb-68d9-4811-b421-f33333261867 -9178920986366339267 1b
使用ColumnFamilyOutputFormat進行堆棧跟蹤:
java.io.IOException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter$RangeClient.run(ColumnFamilyRecordWriter.java:224)
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:41)
at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:123)
at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter$RangeClient.run(ColumnFamilyRecordWriter.java:215)
Caused by: java.net.ConnectException: Connection timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
... 4 more
...並使用CqlOutputFormat:
java.io.IOException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:271)
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:41)
at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:123)
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:262)
Caused by: java.net.ConnectException: Connection timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
... 4 more
兩條跟蹤最終都指向AbstractColumnFamilyOutputFormat.createAuthenticatedClient(host, port, conf)
。
然后,我打開該源,並為異常添加了一些詳細信息,因此它將輸出所連接的主機名,從而產生了以下跟蹤信息:
java.io.IOException: java.lang.Exception: Unable to connect to host [hostname]
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:271)
Caused by: java.lang.Exception: Unable to connect to host [hostname]
at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:139)
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:262)
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:41)
at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:124)
... 1 more
Caused by: java.net.ConnectException: Connection timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
... 4 more
問題是[主機名]是不在分析集群中的計算機(它在美國東部) 。 為什么它不自動地知道這一點,特別是當讀取正常工作時? 似乎正在嘗試環網中的所有節點,而與DC無關。
作為記錄,使用CqlOutputFormat
, ColumnFamilyOutputFormat
以及通過使用CqlStorage
和CassandraStorage
Pig寫入失敗。
我會說,嘗試將cassandra.yaml中的write_request_timeout_in_ms設置為一個很高的數字,看看是否有幫助。 節點本身在出現故障時仍無響應時,可能會出現問題。 如果仍然超時,請在引起問題的那個節點上重新啟動服務。
這個問題歸結為兩件事:
對於多區域EC2設置,Cassandra要求將broadcast_address設置為公用IP,將listen_address設置為內部IP。 在大多數情況下,您希望rpc_address為內部IP,但這可能會破壞Cassandra的Hadoop客戶端,后者基於廣播_地址確定要與之對話的端點。
Cassandra的Hadoop客戶端(特別是RingCache)在節點發現時不考慮數據中心,而是嘗試發現環中的所有節點-包括非本地節點。 它尊重實際寫入的一致性級別,但是在我們的例子中,由於#1,它從未達到目標。
我提交了票證並提交了補丁程序來解決這些問題:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.