Cassandra slow performance on AWS

Question

One of our DBAs has benchmarked Cassandra to Oracle on AWS EC2 for INSERT performance (1M records) using the same Python code (below), and got the following surprising results:

Oracle 12.2 , Single node, 64cores/256GB, EC2 EBS storage, 38 sec

Cassandra 5.1.13 (DDAC), Single node , 2cores/4GB, EC2 EBS storage, 464 sec

Cassandra 3.11.4, Four nodes , 16cores/64GB(each node), EC2 EBS Storage, 486 sec

SO - What are we doing wrong?
How come Cassandra is performing so slow?
* Not enough nodes? (How come the 4 nodes is slower than single node?)
* Configuration issues?
* Something else?

Thanks!

Following is the Python code:

import logging
import time
from cassandra import ConsistencyLevel
from cassandra.cluster import Cluster, BatchStatement
from cassandra.query import SimpleStatement
from cassandra.auth import PlainTextAuthProvider

class PythonCassandraExample:

    def __init__(self):
        self.cluster = None
        self.session = None
        self.keyspace = None
        self.log = None

    def __del__(self):
        self.cluster.shutdown()

    def createsession(self):
        auth_provider = PlainTextAuthProvider(username='cassandra', password='cassandra')
        self.cluster = Cluster(['10.220.151.138'],auth_provider = auth_provider)
        self.session = self.cluster.connect(self.keyspace)

    def getsession(self):
        return self.session

    # How about Adding some log info to see what went wrong
    def setlogger(self):
        log = logging.getLogger()
        log.setLevel('INFO')
        handler = logging.StreamHandler()
        handler.setFormatter(logging.Formatter("%(asctime)s [%(levelname)s] %(name)s: %(message)s"))
        log.addHandler(handler)
        self.log = log

    # Create Keyspace based on Given Name
    def createkeyspace(self, keyspace):
        """
        :param keyspace:  The Name of Keyspace to be created
        :return:
        """
        # Before we create new lets check if exiting keyspace; we will drop that and create new
        rows = self.session.execute("SELECT keyspace_name FROM system_schema.keyspaces")
        if keyspace in [row[0] for row in rows]:
            self.log.info("dropping existing keyspace...")
            self.session.execute("DROP KEYSPACE " + keyspace)

        self.log.info("creating keyspace...")
        self.session.execute("""
                CREATE KEYSPACE %s
                WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '2' }
                """ % keyspace)

        self.log.info("setting keyspace...")
        self.session.set_keyspace(keyspace)

    def create_table(self):
        c_sql = """
                CREATE TABLE IF NOT EXISTS employee (emp_id int PRIMARY KEY,
                                              ename varchar,
                                              sal double,
                                              city varchar);
                 """
        self.session.execute(c_sql)
        self.log.info("Employee Table Created !!!")

    # lets do some batch insert
    def insert_data(self):
        i = 1
        while i < 1000000:
          insert_sql = self.session.prepare("INSERT INTO  employee (emp_id, ename , sal,city) VALUES (?,?,?,?)")
          batch = BatchStatement()
          batch.add(insert_sql, (i, 'Danny', 2555, 'De-vito'))
          self.session.execute(batch)
          # self.log.info('Batch Insert Completed for ' + str(i))
          i += 1

    # def select_data(self):
    #    rows = self.session.execute('select count(*) from perftest.employee limit 5;')
    #    for row in rows:
    #        print(row.ename, row.sal)

    def update_data(self):
        pass

    def delete_data(self):
        pass


if __name__ == '__main__':
    example1 = PythonCassandraExample()
    example1.createsession()
    example1.setlogger()
    example1.createkeyspace('perftest')
    example1.create_table()

    # Populate perftest.employee table
    start = time.time()
    example1.insert_data()
    end = time.time()
    print ('Duration: ' + str(end-start) + ' sec.')

    # example1.select_data()

Answer 1

There are multiple issues here:

for 2nd test you didn't allocate enough memory and cores for DDAC, so Cassandra got only 1Gb heap - Cassandra by default takes 1/4th of all available memory. The same is for 3rd test - it will get only 16Gb RAM for heap, you may need to bump it to higher value, like, 24Gb or even higher.
it's not clear how many IOPs you have in each test - EBS has different throughput depending on the size of the volume, and its type
You're using synchronous API to execute commands - basically you insert next item after you get confirmation that previous is inserted. The best throughput could be achieved by using asynchronous API ;
You're preparing your statement in every iteration - this lead to sending CQL string to server each time, so it's slows down everything - just move line insert_sql = self.session.prepare( out of the loop;
(not completely related) You're using batch statements to write data - it's anti-pattern in Cassandra , as data is sent only to one node, that then should distribute data to nodes that really own the data. This explains why 4 nodes cluster is worse than 1 node cluster.

PS realistic load testing is quite hard. There are specialized tools for this, you can find, for example, more information in this blog post .

Answer 2

The updated code below will batch every 100 records:

"""
Python  by Techfossguru
Copyright (C) 2017  Satish Prasad

"""
import logging
import warnings
import time
from cassandra import ConsistencyLevel
from cassandra.cluster import Cluster, BatchStatement
from cassandra.query import SimpleStatement
from cassandra.auth import PlainTextAuthProvider

class PythonCassandraExample:

    def __init__(self):
        self.cluster = None
        self.session = None
        self.keyspace = None
        self.log = None

    def __del__(self):
        self.cluster.shutdown()

    def createsession(self):
        auth_provider = PlainTextAuthProvider(username='cassandra', password='cassandra')
        self.cluster = Cluster(['10.220.151.138'],auth_provider = auth_provider)
        self.session = self.cluster.connect(self.keyspace)

    def getsession(self):
        return self.session

    # How about Adding some log info to see what went wrong
    def setlogger(self):
        log = logging.getLogger()
        log.setLevel('INFO')
        handler = logging.StreamHandler()
        handler.setFormatter(logging.Formatter("%(asctime)s [%(levelname)s] %(name)s: %(message)s"))
        log.addHandler(handler)
        self.log = log

    # Create Keyspace based on Given Name
    def createkeyspace(self, keyspace):
        """
        :param keyspace:  The Name of Keyspace to be created
        :return:
        """
        # Before we create new lets check if exiting keyspace; we will drop that and create new
        rows = self.session.execute("SELECT keyspace_name FROM system_schema.keyspaces")
        if keyspace in [row[0] for row in rows]:
            self.log.info("dropping existing keyspace...")
            self.session.execute("DROP KEYSPACE " + keyspace)

        self.log.info("creating keyspace...")
        self.session.execute("""
                CREATE KEYSPACE %s
                WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '2' }
                """ % keyspace)

        self.log.info("setting keyspace...")
        self.session.set_keyspace(keyspace)

    def create_table(self):
        c_sql = """
                CREATE TABLE IF NOT EXISTS employee (emp_id int PRIMARY KEY,
                                              ename varchar,
                                              sal double,
                                              city varchar);
                 """
        self.session.execute(c_sql)
        self.log.info("Employee Table Created !!!")

    # lets do some batch insert
    def insert_data(self):
        i = 1
        insert_sql = self.session.prepare("INSERT INTO  employee (emp_id, ename , sal,city) VALUES (?,?,?,?)")
        batch = BatchStatement()
        warnings.filterwarnings("ignore", category=FutureWarning)

        while i < 1000001:
          # insert_sql = self.session.prepare("INSERT INTO  employee (emp_id, ename , sal,city) VALUES (?,?,?,?)")
          # batch = BatchStatement()
          batch.add(insert_sql, (i, 'Danny', 2555, 'De-vito'))

          # Commit every 100 records
          if (i % 100 == 0):
             self.session.execute(batch)
             batch = BatchStatement()
             # self.log.info('Batch Insert Completed for ' + str(i))
          i += 1
        self.session.execute(batch)

    # def select_data(self):
    #    rows = self.session.execute('select count(*) from actimize.employee limit 5;')
    #    for row in rows:
    #        print(row.ename, row.sal)

    def update_data(self):
        pass

    def delete_data(self):
        pass


if __name__ == '__main__':
    example1 = PythonCassandraExample()
    example1.createsession()
    example1.setlogger()
    example1.createkeyspace('actimize')
    example1.create_table()

    # Populate actimize.employee table
    start = time.time()
    example1.insert_data()
    end = time.time()
    print ('Duration: ' + str(end-start) + ' sec.')

    # example1.select_data()

Cassandra slow performance on AWS

Question

2 answers

solution1
5 2019-05-13 14:18:46

solution2
0 2019-05-12 09:27:33

Cassandra slow performance on AWS

Question

2 answers

solution1 5 2019-05-13 14:18:46

solution2 0 2019-05-12 09:27:33

solution1
5 2019-05-13 14:18:46

solution2
0 2019-05-12 09:27:33