Cassandra (CQL) schema/tables look the same as RDBMS for my scenario

Question

I have looked at the Twissandra examples. I asked a similar question regarding this a few days back and received some tips I implemented here. However, by looking at the tables (column families) I see barely any difference between this and a relational database.

My scenario: A simple address book where a user can create his own contacts and group them (one contact can be placed in many groups, one group can contain many contacts). A contact may have multiple addresses for example.

I want to retrieve all the contacts who live in address x and are placed in group y. Therefore, I did the following:

CREATE TABLE if not exists User (user_id uuid, contact_id uuid, type varchar, email varchar, PRIMARY KEY(id));
CREATE TABLE if not exists Contact (contact_id uuid, firstname varchar,lastname varchar, photo blob, imagelength int, note varchar, PRIMARY KEY (id));
CREATE TABLE if not exists Address (address_id uuid, contact_id uuid, street varchar, number int, zipcode varchar, country varchar, PRIMARY KEY(address_id));
CREATE TABLE if not exists Group (group_id uuid, user_id, groupname varchar, PRIMARY KEY(group_id));
CREATE TABLE if not exists Group_Contact (group_id uuid, contact_id, PRIMARY KEY(id, contact_id));

However, based on this, this is literally exactly the same as a relational database, well, except that I believe Cassandra is putting this data in a different way than a RDBMS on disk. I don't see how this can be made better in Cassandra and whether I even modeled this the right way. It just feels as a plain relational database. I feel that I did something wrong since I have to use application level joins to get the address of the contacts. I really don't know how I can de-normalize this to allow multiple addresses (and maybe even phones, emails).

Any suggestions to improve this scenario would be greatly appreciated!

Answer 1

As jny indicated, data duplication, denormalization and query-based modeling are keys to building good Cassandra data models. If I wanted to take your tables above, and build a table to support address/contact queries based-on country, I could do it like this:

First, I'll create a user defined type for the contact's address.

aploetz@cqlsh:stackoverflow> CREATE TYPE contactAddress (
             ...   street varchar, 
             ...   city varchar,
             ...   zip_code varchar,
             ...   country varchar);

Next, I'll create a table called UserContactsByCountry to store user contact info, as well as any user contact addresses:

aploetz@cqlsh:stackoverflow> CREATE TABLE UserContactsByCountry (
             ...   country varchar,
             ...   user_id uuid,
             ...   type varchar,
             ...   email varchar,
             ...   firstname varchar,
             ...   lastname varchar,
             ...   photo blob,
             ...   imagelength int,
             ...   note varchar,
             ...   addresses map<text, frozen <contactAddress>>,
             ...   PRIMARY KEY ((country),user_id));

A couple of things to note here:

I am using country as a partitioning key for querying, and addding user_id as a clustering key for uniqueness.
Technically, country is being stored multiple in each row. Once as the partiiton key, and again with each address. Note that the country partition key is the one which allows us to run our query.
I assume that user contacts can have multiple addresses, so I'll store them in a map of type text (varchar), contactAddress (type created above).

Next, I'll insert three user contacts, each with two addresses, two from the USA and one from Great Britain.

aploetz@cqlsh:stackoverflow> INSERT INTO usercontactsbycountry (country, user_id, type, email, firstname, lastname, note, addresses)
VALUES ('USA',uuid(),'Tech','brycelynch@network23.com','Bryce','Lynch','Head of R&D at Network 23',{'work':{street:'101 Big Network Drive',city:'New York',zip_code:'10023',country:'USA'},'home':{street:'8192 N. 42nd St.',city:'New York',zip_code:'10025',country:'USA'}});
aploetz@cqlsh:stackoverflow> INSERT INTO usercontactsbycountry (country, user_id, type, email, firstname, lastname, note, addresses)
VALUES ('USA',uuid(),'Reporter','edisoncarter@network23.com','Edison','Carter','Reporter at Network 23',{'work':{street:'101 Big Network Drive',city:'New York',zip_code:'10023',country:'USA'},'home':{street:'76534 N. 62nd St.',city:'New York',zip_code:'10024',country:'USA'}});
aploetz@cqlsh:stackoverflow> INSERT INTO usercontactsbycountry (country, user_id, type, email, firstname, lastname, note, addresses)
VALUES ('GBR',uuid(),'Reporter','theorajones@network23.com','Theora','Jones','Controller at Network 23',{'work':{street:'101 Big Network Drive',city:'New York',zip_code:'10023',country:'USA'},'home':{street:'821 Wembley St.',city:'London',zip_code:'W11 2BQ',country:'GBR'}});

Now I can query that table for all user contacts in the USA:

aploetz@cqlsh:stackoverflow> SELECT * FROM usercontactsbycountry WHERE country ='USA';
 country | user_id                              | addresses                                                                                                                                                                                    | email                      | firstname | imagelength | lastname | note                      | photo | type
---------+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+-------------+----------+---------------------------+-------+----------
     USA | 2dee94e2-4887-4988-8cf5-9aee5fd0ea1e |  {'home': {street: '8192 N. 42nd St.', city: 'New York', zip_code: '10025', country: 'USA'}, 'work': {street: '101 Big Network Drive', city: 'New York', zip_code: '10023', country: 'USA'}} |   brycelynch@network23.com |     Bryce |        null |    Lynch | Head of R&D at Network 23 |  null |     Tech
     USA | b92612dd-dbaa-42f2-8ff2-d36b6c601aeb | {'home': {street: '76534 N. 62nd St.', city: 'New York', zip_code: '10024', country: 'USA'}, 'work': {street: '101 Big Network Drive', city: 'New York', zip_code: '10023', country: 'USA'}} | edisoncarter@network23.com |    Edison |        null |   Carter |    Reporter at Network 23 |  null | Reporter

(2 rows)

There are probably other ways in which this could be modeled, but this is one that I hoped to use to help you understand some of the techniques available.

Answer 2

It is difficult to switch from modeling for relational databases to modeling for Cassandra, because they seem so similar: the query language looks almost the same. But the first rule of Cassandra is model to your queries while in Relational Databases we model to data. That means:

Consider what your query the most on
Learn about partition keys and cluster keys
Don't be afraid of data duplication

There is a good example on data modeling in Cassandra: https://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_music_service_c.html

Cassandra (CQL) schema/tables look the same as RDBMS for my scenario

Question

2 answers

solution1
3 ACCPTED 2015-03-24 18:55:19

solution2
2 2015-03-24 14:15:09

Cassandra (CQL) schema/tables look the same as RDBMS for my scenario

Question

2 answers

solution1 3 ACCPTED 2015-03-24 18:55:19

solution2 2 2015-03-24 14:15:09

solution1
3 ACCPTED 2015-03-24 18:55:19

solution2
2 2015-03-24 14:15:09