[英]Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded)
I am doing read and update queries on a table having 500000 rows and some times getting below error after processing around 300000 rows, even when no node is down. 我正在对具有500000行的表进行读取和更新查询,并且在处理大约300000行之后有时会低于错误,即使没有节点关闭也是如此。
Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded)
在一致性ONE的读取查询期间的Cassandra超时(需要1个响应但仅响应0个副本)
Infrastructure details: 基建细节:
Having 5 Cassandra nodes, 5 spark and 3 Hadoop nodes each with 8 cores and 28 GB memory and Cassandra replication factor is 3 . 拥有5个Cassandra节点,5个spark和3个Hadoop节点,每个节点有8个内核和28 GB内存,Cassandra 复制因子为3 。
Cassandra 2.1.8.621 | 卡桑德拉2.1.8.621 | DSE 4.7.1 |
DSE 4.7.1 | Spark 1.2.1 |
Spark 1.2.1 | Hadoop 2.7.1.
Hadoop 2.7.1。
Cassandra configuration: Cassandra配置:
read_request_timeout_in_ms (ms): 10000
range_request_timeout_in_ms (ms): 10000
write_request_timeout_in_ms (ms): 5000
cas_contention_timeout_in_ms (ms): 1000
truncate_request_timeout_in_ms (ms): 60000
request_timeout_in_ms (ms): 10000.
I have tried the same job by increasing read_request_timeout_in_ms
(ms) to 20,000 as well but it didn't help. 我通过将
read_request_timeout_in_ms
(ms)增加到20,000来尝试相同的工作,但它没有帮助。
I am doing queries on two tables. 我正在对两张桌子进行查询。 Below is the create statement for one of the tables:
下面是其中一个表的create语句:
Create Table: 创建表:
CREATE TABLE section_ks.testproblem_section (
problem_uuid text PRIMARY KEY,
documentation_date timestamp,
mapped_code_system text,
mapped_problem_code text,
mapped_problem_text text,
mapped_problem_type_code text,
mapped_problem_type_text text,
negation_ind text,
patient_id text,
practice_uid text,
problem_category text,
problem_code text,
problem_comment text,
problem_health_status_code text,
problem_health_status_text text,
problem_onset_date timestamp,
problem_resolution_date timestamp,
problem_status_code text,
problem_status_text text,
problem_text text,
problem_type_code text,
problem_type_text text,
target_site_code text,
target_site_text text
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
Queries : 查询:
1) SELECT encounter_uuid, encounter_start_date FROM section_ks.encounters WHERE patient_id = '1234' AND encounter_start_date >= '" + formatted_documentation_date + "' ALLOW FILTERING;
1)
SELECT encounter_uuid, encounter_start_date FROM section_ks.encounters WHERE patient_id = '1234' AND encounter_start_date >= '" + formatted_documentation_date + "' ALLOW FILTERING;
2) UPDATE section_ks.encounters SET testproblem_uuid_set = testproblem_uuid_set + {'1256'} WHERE encounter_uuid = 'abcd345';
2)
UPDATE section_ks.encounters SET testproblem_uuid_set = testproblem_uuid_set + {'1256'} WHERE encounter_uuid = 'abcd345';
Usually when you get a timeout error it means you are trying to do something that isn't scaling well in Cassandra. 通常,当您收到超时错误时,这意味着您正在尝试执行在Cassandra中无法正常扩展的操作。 The fix is often to modify your schema.
修复通常是修改您的架构。
I suggest you monitor the nodes while running your query to see if you can spot the problem area. 我建议您在运行查询时监视节点,看看是否可以发现问题区域。 For example, you can run "watch -n 1 nodetool tpstats" to see if any queues are backing up or dropping items.
例如,您可以运行“watch -n 1 nodetool tpstats”来查看是否有任何队列正在备份或删除项目。 See other monitoring suggestions here .
请在此处查看其他监控建议
One thing that might be off in your configuration is that you say you have five Cassandra nodes, but only 3 spark workers (or are you saying you have three spark workers on each Cassandra node?) You'll want at least one spark worker on each Cassandra node so that loading data into spark is done locally on each node and not over the network. 在您的配置中可能有一件事情就是你说你有五个Cassandra节点,但只有3个火花工人(或者你是说你在每个Cassandra节点上都有三个火花工人?)你至少需要一个火花工人每个Cassandra节点,以便将数据加载到spark中,在每个节点上本地完成,而不是通过网络完成。
It's hard to tell much more than that without seeing your schema and the query you are running. 如果没有看到您正在运行的架构和查询,就很难说清楚。 Are you reading from a single partition?
你在阅读单个分区吗? I started getting timeout errors in the vicinity of 300,000 rows when reading from a single partition.
从单个分区读取时,我开始在300,000行附近发生超时错误。 See question here .
在这里查看问题。 The only workaround I have found so far is to use a client side hash in my partition key to break the partitions up into smaller chunks of around 100K rows.
到目前为止,我发现的唯一解决方法是在我的分区键中使用客户端哈希将分区分成大约100K行的较小块。 So far I have not found a way to tell Cassandra to not timeout for a query that I expect to take a long time.
到目前为止,我还没有找到一种方法告诉Cassandra没有超时的查询,我希望需要很长时间。
Don't think configuration is a root cause, but data model issue. 不要认为配置是根本原因,而是数据模型问题。
It would be cool to see a structure of section_ks.encounters table. 看到section_ks.encounters表的结构会很酷。
Suggested to think carefully about what concrete queries expected to run before design table(s) structure. 建议仔细考虑在设计表结构之前预期要运行的具体查询。
As far as I see, those two queries expects different structure of section_ks.encounters to run them with good performance. 据我所知,这两个查询期望section_ks.encounters的不同结构以良好的性能运行它们。
Let's review each provided query and try to design tables: 让我们回顾一下每个提供的查询并尝试设计表:
First one: 第一:
SELECT encounter_uuid, encounter_start_date FROM section_ks.encounters WHERE patient_id = '1234' AND encounter_start_date >= '" + formatted_documentation_date + "' ALLOW FILTERING;
SELECT encounter_uuid,encounter_start_date FROM section_ks.encounters WHERE patient_id ='1234'ANDy_start_date> ='“+ formatted_documentation_date +”'ALLOW FILTERING;
Here is an example of table structure fits effectively with given query: 以下是表结构与给定查询有效匹配的示例:
create table section_ks.encounters(
patient_id bigint,
encounter_start_date timestamp,
encounter_uuid text,
some_other_non_unique_column text,
PRIMARY KEY (patient_id, encounter_start_date)
);
ALLOW FILTERING now can be removed from query: 现在可以从查询中删除允许过滤:
SELECT encounter_uuid, encounter_start_date
FROM section_ks.encounters
WHERE patient_id = '1234' AND encounter_start_date >= '2017-08-19';
Second query: 第二个查询:
UPDATE section_ks.encounters SET testproblem_uuid_set = testproblem_uuid_set + {'1256'} WHERE encounter_uuid = 'abcd345';
UPDATE section_ks.encounters SET testproblem_uuid_set = testproblem_uuid_set + {'1256'} WHERE encounter_uuid ='abcd345';
Table structure should look like close to: 表结构应该看起来像接近:
create table section_ks.encounters(
encounter_uuid text, -- partition key
patient_id bigint,
testproblem_uuid_set text,
some_other_non_unique_column text,
PRIMARY KEY (encounter_uuid)
);
If we definitively would like to make a quick filtering only by encounter_uuid , it should be defined as partition key. 如果我们明确希望仅通过encounter_uuid进行快速过滤,则应将其定义为分区键。
Good articles about designing of effective data model: 关于有效数据模型设计的好文章:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.