简体   繁体   English

二级索引的Cassandra查询非常慢

[英]Cassandra query on secondary index is very slow

We have a table with about 40k rows, querying on secondary index is slow(30 seconds on production). 我们有一个大约40k行的表,查询二级索引很慢(生产时间为30秒)。 Our cassandra is 1.2.8. 我们的cassandra是1.2.8。 The table schema is as following: 表模式如下:

CREATE TABLE usertask (
  tid uuid PRIMARY KEY,
  content text,
  ts int
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'SnappyCompressor'};

CREATE INDEX usertask_ts_idx ON usertask (ts);

When I turn on tracing, I notice there is a lot of lines like the following: 当我打开跟踪时,我注意到有很多行如下:

Executing single-partition query on usertask.usertask_ts_idx

With only 40k rows, it looks like there are some thousands of query on usertask_ts_idx. 只有40k行,看起来有一些关于usertask_ts_idx的查询。 What could be the problem? 可能是什么问题呢? Thanks 谢谢

More investigation 更多调查

I try the same query on our test server, it is much faster(30 seconds on prod, 1-2 seconds on test server). 我在我们的测试服务器上尝试相同的查询,速度更快(测试服务器上30秒,测试服务器上1-2秒)。 After comparing the tracing log, the difference is the time spending at seeking to partition indexed section in data file. 在比较跟踪日志之后,差异是在数据文件中寻求分区索引部分所花费的时间。 On our production it takes 1000-3000 micro sec for each seek, on dev server it takes 100 micro seconds. 在我们的生产中,每次搜索需要1000-3000微秒,在开发服务器上需要100微秒。 I guess our production server has not enough memory to cache the data file so it is slow at seeking in data file. 我想我们的生产服务器没有足够的内存来缓存数据文件,因此在数据文件中查找速度很慢。

I am presuming ts is a timestamp, in which case this is not a good candidate for a secondary index. 我假设ts是一个时间戳,在这种情况下,这不是二级索引的良好候选者。 The reason is that it's a high cardinality value (ie all values are essentially unique). 原因是它是一个高基数值(即所有值基本上都是唯一的)。 This means you'll end up with almost one row in the index for each row in usertask --effectively resulting in a join operation. 这意味着你将在usertask每一行的索引中最终得到一行 - usertask地导致连接操作。 Joins are terribly slow on a distributed database. 联接在分布式数据库上非常慢。 Since you haven't shown your query I'm not sure exactly what you're doing, but you'll need to rethink your model if you want to query based on time. 由于您没有显示您的查询,我不确定您正在做什么,但如果您想根据时间进行查询,则需要重新考虑您的模型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM