[英]How should I store a date interval in Cassandra?
I'm working on an application that stores sensor measurements. 我正在开发一个存储传感器测量值的应用程序。 Sometimes, the sensors will send erroneous measurements (eg the measured value is out of bound).
有时,传感器会发送错误的测量值(例如,测量值超出范围)。 We do not want to persist each measurement error separately, but we want to persist statistics about these errors, such as the sensor id, the date of the first error, the date of the last error, and other infos like the number of successive errors, which I'll omit here...
我们不想分别保存每个测量错误,但是我们想要保存有关这些错误的统计信息,例如传感器ID,第一个错误的日期,最后一个错误的日期以及其他信息(如连续错误的数量) ,这里我将省略...
Here is a simplified version of the "ErrorStatistic" class: 这是“ ErrorStatistic”类的简化版本:
package foo.bar.repository;
import org.joda.time.DateTime;
import javax.annotation.Nonnull;
import javax.annotation.Nullable;
import static com.google.common.base.Preconditions.checkNotNull;
public class ErrorStatistic {
@Nonnull
private final String sensorId;
@Nonnull
private final DateTime startDate;
@Nullable
private DateTime endDate;
public ErrorStatistic(@Nonnull String sensorId, @Nonnull DateTime startDate) {
this.sensorId = checkNotNull(sensorId);
this.startDate = checkNotNull(startDate);
this.endDate = null;
}
@Nonnull
public String getSensorId() {
return sensorId;
}
@Nonnull
public DateTime getStartDate() {
return startDate;
}
@Nullable
public DateTime getEndDate() {
return endDate;
}
public void setEndDate(@Nonnull DateTime endDate) {
this.endDate = checkNotNull(endDate);
}
}
I am currently persisting these ErrorStatistic using Hector as follows: 我目前使用Hector坚持这些ErrorStatistic,如下所示:
private void persistErrorStatistic(ErrorStatistic errorStatistic) {
Mutator<String> mutator = HFactory.createMutator(keyspace, StringSerializer.get());
String rowKey = errorStatistic.getSensorId();
String columnName = errorStatistic.getStartDate().toString(YYYY_MM_DD_FORMATTER);
byte[] value = serialize(errorStatistic);
HColumn<String, byte[]> column = HFactory.createColumn(columnName, value, StringSerializer.get(), BytesArraySerializer.get());
mutator.addInsertion(rowKey, COLUMN_FAMILY, column);
mutator.execute();
}
private static final DateTimeFormatter YYYY_MM_DD_FORMATTER = DateTimeFormat.forPattern("yyyy-MM-dd");
When we receive the first measurement in error, we create an ErrorStatistic with sensorId
and startDate
set, and a null endDate
. 当我们收到错误的第一个测量值时,我们将创建一个带有
sensorId
和startDate
设置以及一个空endDate
的ErrorStatistic。 This ErrorStatistic is kept in our in-memory model, and persisted in Cassandra. 此ErrorStatistic保留在我们的内存模型中,并保留在Cassandra中。 We then update the ErrorStatistic in memory for the next measurements in error, until we receive a valid measurement, at which point the ErrorStatistic is persisted and removed from our in-memory model.
然后,我们更新内存中的ErrorStatistic以进行下一次错误测量,直到收到有效的测量为止,此时ErrorStatistic会保留并从内存模型中删除。
Cassandra thus contains ErrorStatistics with open-ended intervals (eg [2012-08-01T00:00Z|null]), and closed intervals (eg [2012-08-01T00:00Z|2013-01-12T10:23Z]). 因此,Cassandra包含具有开放时间间隔(例如[2012-08-01T00:00Z | null])和封闭时间间隔(例如[2012-08-01T00:00Z | 2013-01-12T10:23Z])的ErrorStatistics。
I want to be able to query these ErrorStatistics by date. 我希望能够按日期查询这些ErrorStatistics。
For example, if I have these 3 error statistics: 例如,如果我具有以下3个错误统计信息:
sensorId = foo
startDate = 2012-08-01T00:00Z
endDate = 2012-09-03T02:10Z
sensorId = foo
startDate = 2012-10-04T03:12Z
endDate = 2013-02-01T12:28Z
sensorId = foo
startDate = 2013-03-05T23:22Z
endDate = null
(this means we have not received a valid measurement since 2013-03-05)
If I query Cassandra with the date: 如果我用日期查询Cassandra:
I am not sure how I should store and "index" these ErrorStatistic objects, to efficiently query them. 我不确定如何存储和“索引”这些ErrorStatistic对象以有效地查询它们。 I am quite new to Cassandra, and I might be missing something obvious.
我对Cassandra并不陌生,可能会遗漏一些显而易见的东西。
Edit: the following was added in response to Joost's suggestion that I should focus on the type of queries I am interested in. 编辑:添加以下内容是为了响应Joost的建议,即我应该关注我感兴趣的查询类型。
I will have two types of query: 我将有两种查询类型:
startDate
precedes the given date, with a null endDate
(this means that we are still receiving errors for this sensor). startDate
之前给定的日期,用一个空endDate
(这意味着我们仍然收到错误,此传感器)。 I don't know how to do this efficiently. startDate
that precedes the given date (if any), then load it and check in Java if its endDate
is null
or after the given date. startDate
早于给定的日期(如果有),然后加载它并检查Java中的endDate
是否为null
或在给定的日期之后。 But I have no idea if that's possible, and how efficient that would be. The question you have to ask yourself is what questions you have towards the ErrorStatistics. 您必须问自己的问题是您对ErrorStatistics有什么问题。 Cassandra schema design typically starts with a 'Table per query' approach.
Cassandra模式设计通常从“每个查询的表”方法开始。 Don't start with the data (entities) you have, but with your questions/queries.
不要从拥有的数据(实体)开始,而要从问题/查询开始。 This is a different mindset than 'traditional' rdbms design, and I found it takes some time to get used to.
这与“传统” rdbms设计不同,我发现要花一些时间才能习惯。
For example, do you want to query the statistics per Sensor? 例如,您要查询每个Sensor的统计信息吗? Than a table with a composite key (sensor id, timeuuid) could be a solution.
比起带有复合键(传感器ID,Timeuuid)的表可能是一种解决方案。 Such a table allows for quick lookup per sensor id, sorting the results based on time.
这样的表允许按传感器ID快速查找,并根据时间对结果进行排序。
If you want to query the sensor statistics based on time only, a (composite) key with a time unit may be of more help, possibly with sharding elements to better distribute the load over nodes. 如果您只想基于时间查询传感器统计信息,则带有时间单位的(复合)键可能会提供更多帮助,可能需要使用分片元素以更好地在节点上分配负载。 Note that there is catch: range queries on primary keys are not feasible using the Cassandra random or murmur partitioners.
请注意,这里存在陷阱:使用Cassandra随机或杂音分区程序对主键进行范围查询是不可行的。 There are other partitioners, but they easily tend to uneven load distribution in your cluster.
还有其他分区程序,但它们很容易导致群集中的负载分配不均。
In short, start with the answers you want, and then work 'backwards' to your table design. 简而言之,从所需的答案开始,然后“向后”进行表格设计。 With a proper schema, your code will follow.
使用正确的架构,您的代码将随之而来。
Addition (2013-9-5): What is good to know is that Cassandra sorts data within the scope of a single partition key. 添加(2013-9-5):很好的是,Cassandra在单个分区键的范围内对数据进行排序。 That is something very useful.
那是非常有用的。 For example the measurements would be ordered by start_time in descending order (newest first) if you define a table as:
例如,如果将表定义为,则测量将按start_time降序排列(从新到旧)。
create table SensorByDate
(
sensor_id uuid,
start_date datetime,
end_date datetime,
measurement int
primary key (sensor_id, start_date)
)
with clustering order by (start_time DESC);
In this example the sensor_id is the partition key and determines the node this row is stored on. 在此示例中,sensor_id是分区键,并确定存储该行的节点。 The start_date is the second item in the composite key and determines the sort order.
start_date是组合键中的第二项,它确定排序顺序。
To get the first measurement after a certain start date in this table you could formulate a query like 要在此表中的某个开始日期之后进行首次测量,您可以制定如下查询
select * from SensorByDate
where sensor_id = ? and start_date < ? limit 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.