简体   繁体   English

Azure表存储查询partitionkey

[英]Azure table storage querying partitionkey

I am using Azure table storage to retrieve data though timestamp filter. 我使用Azure表存储来通过时间戳过滤器检索数据。 I see the execution is very slow as timestamp is not a partition key or row key. 我发现执行速度非常慢,因为时间戳不是分区键或行键。 I researched on stackoverflow and found that time stamp should be converted to ticks and stored in to Partition key. 我研究了stackoverflow,发现时间戳应该转换为ticks并存储到Partition键中。 I did the same and while inserting data I took the below string and inserted tick string to partition key. 我做了同样的事情,在插入数据时我采用了下面的字符串,并将tick字符串插入分区键。

string currentDateTimeTick = ConvertDateTimeToTicks(DateTime.Now.ToUniversalTime()).ToString();

public static long ConvertDateTimeToTicks(DateTime dtInput)
{
    long ticks = 0;
    ticks = dtInput.Ticks;
    return ticks;
}

This is fine till here. 这很好,直到这里。 But When I am trying to retrieve last 5 days data, I am unable to query the tick against partition key. 但是当我尝试检索最近5天的数据时,我无法查询对分区键的勾选。 I am trying to get last 5 days data. 我想获取最近5天的数据。 What was my mistake in the below code? 我在下面的代码中犯了什么错误?

int days = 5;
TableQuery<MyEntity> query = new TableQuery<MyEntity>()
.Where(TableQuery.GenerateFilterConditionForDate("PartitionKey", QueryComparisons.GreaterThanOrEqual, "0"+DateTimeOffset.Now.AddDays(days).Date.Ticks));

Are you sure you want to use ticks as a partition key? 您确定要使用刻度线作为分区键吗? This means that every measureable 100 ns instant becomes it's own partition. 这意味着每个可测量的100 ns瞬间变为它自己的分区。 With time based data you can use the partition key to specify an interval like every hour, minute or even second and then a row key with the actual timestamp. 使用基于时间的数据,您可以使用分区键指定每小时,分钟或甚至秒的间隔,然后使用实际时间戳的行键。

That problem aside let me show you how to do the query. 抛开这个问题让我告诉你如何进行查询。 First let me comment on how you generate the partition key. 首先让我评论一下如何生成分区键。 I suggest you do it like this: 我建议你这样做:

var partitionKey = DateTime.UtcNow.Ticks.ToString("D18");

Don't use DateTime.Now.ToUniversalTime() to get the current UTC time. 不要使用DateTime.Now.ToUniversalTime()来获取当前的UTC时间。 It will internally use DateTime.UtcNow , then convert it to the local time zone and ToUniversalTime() will convert back to UTC which is just wasteful (and more time consuming than you may think). 它将在内部使用DateTime.UtcNow ,然后将其转换为本地时区, ToUniversalTime()将转换回UTC,这只是浪费(并且比您想象的更耗时)。

And your ConvertDateTimeToTicks() method serves no other purpose than to get the Ticks property so it is just making your code more complex without adding any value. 并且您的ConvertDateTimeToTicks()方法除了获取Ticks属性之外没有任何其他目的,因此它只是使您的代码更复杂而不添加任何值。

Here is how to perform the query: 以下是执行查询的方法:

var days = 5;
var partitionKey = DateTime.UtcNow.AddDays(-days).Ticks.ToString("D18")
var query = new TableQuery<MyEntity>().Where(
  TableQuery.GenerateFilterCondition(
    "PartitionKey",
    QueryComparisons.GreaterThanOrEqual,
    partitionKey
  )
);

The partition key is formatted as an 18 characters string allowing you to use a straightforward comparison. 分区键的格式为18个字符的字符串,允许您使用简单的比较。

I suggest that you move the code to generate the partition key (and row key) into a function to make sure that the keys are generated the same way throughout your code. 我建议您移动代码以生成分区键(和行键)到函数中,以确保在整个代码中以相同的方式生成键。

The reason 18 characters are used is because the Ticks value of a DateTime today as well as many thousands of years in the future uses 18 decimal digits. 使用18个字符的原因是因为今天的DateTimeTicks值以及将来的数千年使用18个十进制数字。 If you decide to base your partition key on hours, minutes or seconds instead of 100 ns ticks then you can shorten the length of the partition key accordingly. 如果您决定将分区键基于小时,分钟或秒,而不是100 ns,那么您可以相应地缩短分区键的长度。

As Martin suggests, using a timestamp as your partition key is almost certainly not what you want to do. 正如Martin建议的那样,使用时间戳作为分区键几乎肯定不是你想要做的。

Partitions are the unit of scale in Azure Table Storage and more or less represent physical segmentation of your data. 分区是Azure表存储中的比例单位,或多或少表示数据的物理分段。 They're a scalability optimization that allows you to "throw hardware" at the problem of storing more and more data, while maintaining acceptable response times (something which is traditionally hard in data storage). 它们是一种可扩展性优化,允许您“存储硬件”以解决存储越来越多数据的问题,同时保持可接受的响应时间(传统上在数据存储中很难)。 You define the partitions in your data by assigning partition keys to each row. 您可以通过为每行分配分区键来定义数据中的分区。 Its almost never desirable that each row lives in its own partition. 几乎不可取的是每一行都存在于自己的分区中。

In ATS, the row key becomes your unique key within a given partition . 在ATS中,行键成为给定分区中的唯一键。 So the combination of partition key + row key is the true unique key across the entire ATS table. 因此,分区键+行键的组合是整个ATS表中的真正唯一键。

There's lots of advice out there for choosing a valid partition key and row key... none of which is generalized. 有很多建议可以选择有效的分区键和行键......其中没有一个是通用的。 It depends on the nature of your data, your anticipated query patterns, etc. 这取决于您的数据的性质,您预期的查询模式等。

Choose a partition key that will aggregate your data into a reasonably well-distributed set of "buckets". 选择一个分区键,将您的数据聚合成一组分布均匀的“桶”。 All things being equal, if you anticipate having 1 million rows in your table, it's often useful to have, say, 10 buckets with 100,000 rows each... or maybe 100 buckets with 10,000 rows each. 在所有条件相同的情况下,如果你预计你的表中有100万行,那么拥有10个每行100,000行的桶通常很有用......或者可能有100个桶,每行10,000行。 At query time you'll need to pick the partition(s) you're querying, so the number of buckets may matter to you. 在查询时,您需要选择要查询的分区,因此桶的数量可能对您很重要。 "Buckets" often correspond to a natural segmentation concept in your domain... a bucket to represent each US state, or a bucket to represent each department in your company, etc. Note that its not necessary (or often possible) to have perfectly distributed buckets... get as close as you can, with reasonable effort. “桶”通常对应于您域中的自然分段概念...代表每个美国州的桶,或代表公司中每个部门的桶等。请注意,没有必要(或通常可能)完美分布式水桶...尽可能接近,尽力而为。

One example of where you might intentionally have an uneven distribution is if you intend to vary query patterns by bucket... bucket A will receive lots of cheap, fast queries, bucket B fewer, more expensive queries, etc. Or perhaps bucket A data will remain static while bucket B data changes frequently. 您可能故意分配不均匀分布的一个示例是,如果您打算通过存储桶改变查询模式...存储桶A将接收大量廉价,快速查询,存储桶B更少,更昂贵的查询等。或者可能存储桶A数据当存储桶B数据频繁更改时,它将保持静态 This can be accomplished with multiple tables, too... so there's no "one size fits all" answer. 这也可以通过多个表来实现......所以没有“一刀切”的答案。

Given the limited knowledge we have of your problem, I like Martin's advice of using a time span as your partition key. 鉴于我们对您的问题知识有限,我喜欢Martin建议使用时间跨度作为您的分区键。 Small spans will result in many partitions, and (among other things) make queries that utilize multiple time spans relatively expensive. 小跨度将导致许多分区,并且(除其他外)使得利用多个时间跨度的查询相对昂贵。 Larger spans will result in fewer aggregation costs across spans, but will result in bigger partitions and thus more expensive queries within a partition (it will also make identifying a suitable row key potentially more challenging). 较大的跨度将导致跨越跨越的聚合成本更少,但是将导致更大的分区,从而导致分区内更昂贵的查询(它还将使识别合适的行键更具挑战性)。

Ultimately you'll likely need to experiment with a few options to find the most suitable one for your data and intended queries. 最终,您可能需要尝试一些选项来找到最适合您的数据和预期查询的选项。

One other piece of advice... don't be afraid to consider duplicating data in multiple data stores to suit widely varying query types. 另外一条建议......不要害怕考虑在多个数据存储中复制数据以适应各种各样的查询类型。 Not every query will work effectively against a single schema or storage configuration. 并非每个查询都能有效地针对单个架构或存储配置。 The effort needed to synchronize data across stores may be less than that needed bend query technology X to your will. 跨商店同步数据所需的工作量可能小于您的意愿所需的弯曲查询技术X.

more on Partition and Row key choices 更多关于分区和行键选择

also here 也在这里

Best of luck! 祝你好运!

One thing that was not mentioned in the answers above is that Azure will detect if you are using sequential, always-increasing or always-decreasing values for your partition key and create "range partitions". 上面的答案中没有提到的一件事是,Azure将检测您是否正在使用分区键的顺序,始终增加或始终减少的值,并创建“范围分区”。 Range partitions group entities that have sequential unique PartitionKey values to improve the performance of range queries. 范围分区具有顺序唯一PartitionKey值的实体,以提高范围查询的性能。 Without range partitions, as mentioned above, a range query will need to cross partition boundaries or server boundaries, which can decrease the query performance. 如上所述,如果没有范围分区,范围查询将需要跨越分区边界或服务器边界,这会降低查询性能。 Range partitions happen under-the-hood and are decided by Azure, not you. 范围分区发生在引擎盖下,由Azure决定,而不是你。

Now, if you want to do bulk inserts, let's say once a minute, you will still need to flatten out your timestamp partition keys to, say, ticks rounded up to the nearest minute. 现在,如果你想进行批量插入,让我们说每分钟一次,你仍然需要将你的时间戳分区键变平,比如,向上舍入到最接近的分钟。 You can only do bulk inserts with the same partition key. 您只能使用相同的分区键进行批量插入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM