简体   繁体   中英

Dealing with one-to-many in Cassandra

Given the below simplified domain model:

Train Journey

  • id uuid
  • train name string (unique within one day)
  • date of journey date
  • ...

Journey Stations

  • station uuid
  • arrival timestamp
  • departure timestamp
  • ...

With this relation:

  • 1 Train Journey has n Journey Stations

With this query requirements:

  • All Train Journeys at a given Station, including all Journey Stations , departing between a given timeframe.

How should this query be modelled in Cassandra?

I was thinking about something like:

CREATE TABLE departures_by_station (
    date_of_yourney date,
    train_name varchar,
    station uuid,
    arrival timestamp,
    departure timestamp,
    primary key((date_of_journey, station), train_name, departure)
);

SELECT * from departures_by_station 
WHERE date_of_journey = '2018-01-02' AND station = 'Paris' AND departure ...;

This will not work because it results in a partial Train Journey - all Journey Stations except the requested one are missing. To make it even worse, arrival & departure times may change frequently. I can't update this table with a new departure time because it is a clustering key.

Any idea or hint how to solve this problem? I guess I'm missing something basic here, but I am very new to the nosql world.

First, it maybe not the most optimal solution (need to do some calculations about partition size, etc.). If you can expire "old" data using TTL, then I thought about having something like this:

CREATE TABLE departures_by_station (
   station uuid,
   departure timestamp,
   train_name varchar,
   arrival timestamp,
   statitions list<uuid>
   primary key(station, departure, train_name)
);

In this case you'll have so-called "wide" partitions - one per station of journey, and because you have departure as clustering key, you can do a range search on it. But if you expire quite a lot of data, then you need to perform "repairing" of tables quite often to get rid of tombstones (deletions markers) because this could affect read performance.

Also, to avoid doing multiple lookups, you need to put information about all stations along journey into every row - I wrote this as list<uuid> , but it could be better modeled as user-defined type, so you can include station name, and other information.

You code also should generate an entry for every station along journey for particular train.

PS I recommend to take a DS220 (Data Modeling) course on DataStax Academy .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM