我应该如何在Cassandra中存储日期间隔？

Question

I'm working on an application that stores sensor measurements. 我正在开发一个存储传感器测量值的应用程序。 Sometimes, the sensors will send erroneous measurements (eg the measured value is out of bound). 有时，传感器会发送错误的测量值（例如，测量值超出范围）。 We do not want to persist each measurement error separately, but we want to persist statistics about these errors, such as the sensor id, the date of the first error, the date of the last error, and other infos like the number of successive errors, which I'll omit here... 我们不想分别保存每个测量错误，但是我们想要保存有关这些错误的统计信息，例如传感器ID，第一个错误的日期，最后一个错误的日期以及其他信息（如连续错误的数量），这里我将省略...

Here is a simplified version of the "ErrorStatistic" class: 这是“ ErrorStatistic”类的简化版本：

package foo.bar.repository;

import org.joda.time.DateTime;

import javax.annotation.Nonnull;
import javax.annotation.Nullable;

import static com.google.common.base.Preconditions.checkNotNull;

public class ErrorStatistic {

    @Nonnull
    private final String sensorId;
    @Nonnull
    private final DateTime startDate;
    @Nullable
    private DateTime endDate;

    public ErrorStatistic(@Nonnull String sensorId, @Nonnull DateTime startDate) {
        this.sensorId = checkNotNull(sensorId);
        this.startDate = checkNotNull(startDate);
        this.endDate = null;
    }

    @Nonnull
    public String getSensorId() {
        return sensorId;
    }

    @Nonnull
    public DateTime getStartDate() {
        return startDate;
    }

    @Nullable
    public DateTime getEndDate() {
        return endDate;
    }

    public void setEndDate(@Nonnull DateTime endDate) {
        this.endDate = checkNotNull(endDate);
    }

}

I am currently persisting these ErrorStatistic using Hector as follows: 我目前使用Hector坚持这些ErrorStatistic，如下所示：

private void persistErrorStatistic(ErrorStatistic errorStatistic) {
    Mutator<String> mutator = HFactory.createMutator(keyspace, StringSerializer.get());

    String rowKey = errorStatistic.getSensorId();
    String columnName = errorStatistic.getStartDate().toString(YYYY_MM_DD_FORMATTER);
    byte[] value = serialize(errorStatistic);

    HColumn<String, byte[]> column = HFactory.createColumn(columnName, value, StringSerializer.get(), BytesArraySerializer.get());
    mutator.addInsertion(rowKey, COLUMN_FAMILY, column);

    mutator.execute();
}

private static final DateTimeFormatter YYYY_MM_DD_FORMATTER = DateTimeFormat.forPattern("yyyy-MM-dd");

When we receive the first measurement in error, we create an ErrorStatistic with sensorId and startDate set, and a null endDate . 当我们收到错误的第一个测量值时，我们将创建一个带有sensorId和startDate设置以及一个空endDate的ErrorStatistic。 This ErrorStatistic is kept in our in-memory model, and persisted in Cassandra. 此ErrorStatistic保留在我们的内存模型中，并保留在Cassandra中。 We then update the ErrorStatistic in memory for the next measurements in error, until we receive a valid measurement, at which point the ErrorStatistic is persisted and removed from our in-memory model. 然后，我们更新内存中的ErrorStatistic以进行下一次错误测量，直到收到有效的测量为止，此时ErrorStatistic会保留并从内存模型中删除。

Cassandra thus contains ErrorStatistics with open-ended intervals (eg [2012-08-01T00:00Z|null]), and closed intervals (eg [2012-08-01T00:00Z|2013-01-12T10:23Z]). 因此，Cassandra包含具有开放时间间隔（例如[2012-08-01T00：00Z | null]）和封闭时间间隔（例如[2012-08-01T00：00Z | 2013-01-12T10：23Z]）的ErrorStatistics。

I want to be able to query these ErrorStatistics by date. 我希望能够按日期查询这些ErrorStatistics。

For example, if I have these 3 error statistics: 例如，如果我具有以下3个错误统计信息：

sensorId  = foo
startDate = 2012-08-01T00:00Z
endDate   = 2012-09-03T02:10Z

sensorId  = foo
startDate = 2012-10-04T03:12Z
endDate   = 2013-02-01T12:28Z

sensorId  = foo
startDate = 2013-03-05T23:22Z
endDate   = null
(this means we have not received a valid measurement since 2013-03-05)

If I query Cassandra with the date: 如果我用日期查询Cassandra：

2012-08-04T10:00Z --> it should return the first ErrorStatistic 2012-08-04T10：00Z->它应该返回第一个ErrorStatistic
2012-09-04T00:00Z --> it should return that there were no errors at this time 2012-09-04T00：00Z->它应该返回此时没有错误
2014-01-03T00:00Z --> it should return the last ErrorStatistic (since it is open-ended) 2014-01-03T00：00Z->它应该返回最后的ErrorStatistic（因为它是开放式的）

I am not sure how I should store and "index" these ErrorStatistic objects, to efficiently query them. 我不确定如何存储和“索引”这些ErrorStatistic对象以有效地查询它们。 I am quite new to Cassandra, and I might be missing something obvious. 我对Cassandra并不陌生，可能会遗漏一些显而易见的东西。

Edit: the following was added in response to Joost's suggestion that I should focus on the type of queries I am interested in. 编辑：添加以下内容是为了响应Joost的建议，即我应该关注我感兴趣的查询类型。

I will have two types of query: 我将有两种查询类型：

The first, as you guessed, is to list all ErrorStatistics for a given sensor and time range. 如您所料，第一个是列出给定传感器和时间范围内的所有ErrorStatistics。 This seems relatively easy. 这似乎相对容易。 The only problem I will have, is when an ErrorStatistics starts before the time range I'm interested in (eg I query all errors for the months of april, and I want my query to return ErrorStatistics[2012-03-29:2012-04-02] too...) 我唯一会遇到的问题是ErrorStatistics 在我感兴趣的时间范围之前开始（例如，我查询4月的所有错误，并且希望我的查询返回ErrorStatistics [2012-03-29：2012- 04-02]也是...）
The second query seems harder. 第二个查询似乎更难。 I want to find, for a given sensor and date, the ErrorStatistics whose interval contains the given date, or whose startDate precedes the given date, with a null endDate (this means that we are still receiving errors for this sensor). 我想找到，对于给定的传感器和日期，其间隔ErrorStatistics包含给定的日期，或者他们startDate之前给定的日期，用一个空endDate （这意味着我们仍然收到错误，此传感器）。 I don't know how to do this efficiently. 我不知道如何有效地做到这一点。 I could just load up all ErrorStatistics for the given sensor, then check the intervals in Java... But I'd like to avoid this if possible. 我可以只加载给定传感器的所有ErrorStatistics，然后检查Java中的间隔...但是我想尽可能避免这种情况。 I guess I want Cassandra to start at a given date and look backward until it finds the first ErrorStatistics with a startDate that precedes the given date (if any), then load it and check in Java if its endDate is null or after the given date. 我想我希望Cassandra从给定的日期开始并向后看，直到找到第一个ErrorStatistics且其startDate早于给定的日期（如果有），然后加载它并检查Java中的endDate是否为null或在给定的日期之后。 But I have no idea if that's possible, and how efficient that would be. 但是我不知道这是否可行，以及效率如何。

Answer 1

The question you have to ask yourself is what questions you have towards the ErrorStatistics. 您必须问自己的问题是您对ErrorStatistics有什么问题。 Cassandra schema design typically starts with a 'Table per query' approach. Cassandra模式设计通常从“每个查询的表”方法开始。 Don't start with the data (entities) you have, but with your questions/queries. 不要从拥有的数据（实体）开始，而要从问题/查询开始。 This is a different mindset than 'traditional' rdbms design, and I found it takes some time to get used to. 这与“传统” rdbms设计不同，我发现要花一些时间才能习惯。

For example, do you want to query the statistics per Sensor? 例如，您要查询每个Sensor的统计信息吗？ Than a table with a composite key (sensor id, timeuuid) could be a solution. 比起带有复合键（传感器ID，Timeuuid）的表可能是一种解决方案。 Such a table allows for quick lookup per sensor id, sorting the results based on time. 这样的表允许按传感器ID快速查找，并根据时间对结果进行排序。

If you want to query the sensor statistics based on time only, a (composite) key with a time unit may be of more help, possibly with sharding elements to better distribute the load over nodes. 如果您只想基于时间查询传感器统计信息，则带有时间单位的（复合）键可能会提供更多帮助，可能需要使用分片元素以更好地在节点上分配负载。 Note that there is catch: range queries on primary keys are not feasible using the Cassandra random or murmur partitioners. 请注意，这里存在陷阱：使用Cassandra随机或杂音分区程序对主键进行范围查询是不可行的。 There are other partitioners, but they easily tend to uneven load distribution in your cluster. 还有其他分区程序，但它们很容易导致群集中的负载分配不均。

In short, start with the answers you want, and then work 'backwards' to your table design. 简而言之，从所需的答案开始，然后“向后”进行表格设计。 With a proper schema, your code will follow. 使用正确的架构，您的代码将随之而来。

Addition (2013-9-5): What is good to know is that Cassandra sorts data within the scope of a single partition key. 添加（2013-9-5）：很好的是，Cassandra在单个分区键的范围内对数据进行排序。 That is something very useful. 那是非常有用的。 For example the measurements would be ordered by start_time in descending order (newest first) if you define a table as: 例如，如果将表定义为，则测量将按start_time降序排列（从新到旧）。

create table SensorByDate
(
    sensor_id uuid,
    start_date datetime,
    end_date datetime,
    measurement int
    primary key (sensor_id, start_date)
)
with clustering order by (start_time DESC);

In this example the sensor_id is the partition key and determines the node this row is stored on. 在此示例中，sensor_id是分区键，并确定存储该行的节点。 The start_date is the second item in the composite key and determines the sort order. start_date是组合键中的第二项，它确定排序顺序。

To get the first measurement after a certain start date in this table you could formulate a query like 要在此表中的某个开始日期之后进行首次测量，您可以制定如下查询

select * from SensorByDate 
where sensor_id = ? and start_date < ? limit 1

我应该如何在Cassandra中存储日期间隔？

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-05-01 13:16:18

我应该如何在Cassandra中存储日期间隔？

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-05-01 13:16:18

解决方案1
1 已采纳 2013-05-01 13:16:18