简体   繁体   中英

Hot spot using hive to insert into Cassandra

Evaluating DSE 3.1.3 Cassandra using the EC2 datastax ami .

Test setups

  1. 5 x m1.xlarge in one test: 4vcpus, 15G, 4x420G instance store.
  2. 5 x hi1.4xlarge in another: 16vcpus, 60G, 2x1TB SSD instance store.

Data

  • 5000+ apache logs files, ~ 60GB, 60MM rows.

Workflow

  1. Load into CFS via dse hadoop fs -put
  2. Load into Hive from CFS w/ RegexSerDe.
  3. Create event table in Cassandra via CQL in keyspace logs.
  4. Insert into Cassandra from hive via INSERT INTO logs.event.

Overall, performance of the first two steps, along with basic queries are on par with other hadoop stacks. And being able to simply refer to a Cassandra table directly from hive without having to explicitly define an external table is great.

However, the INSERT operation is taking 3-4 times longer than other common hadoop stacks. I must have set something up wrong, and am looking for help or suggestions.

From a rudimentary look, it is clear that the node on which i ran the hive INSERT command has the cpu running at 12-16, and the other 4 nodes show 1-2 cpu. Also, the write requests all are going to the same node, with none going to the other nodes.

My assumption is that hive would distribute (push down) the work to each node, which it appears to do with the common hadoop stacks.

Otherwise, the key is random, and the data load grows in a balanced manner across the nodes. The keyspace was created with:

CREATE KEYSPACE logs WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

Looking at the jobtracker/task details, the job is split amongst the nodes. But from the status column, it would appear all calls to cfs are routed through the node the job was launched from.

cfs://10.0.0.21/user/hive/warehouse/event/1:2483027968+67108864

I am hoping it is a configuration issue. I am open to other suggestions as well. But this approach is certainly impressively simple, if it can work as it does on other stacks.

Thanks for the finding, I think it's the defect in the code, we will fix it. We may add some configuration so that Hive can use different connection strategy. eg RANDOM, ROUND_ROBIN, STICKY

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM