简体   繁体   English

Citus 如果协调员完全丢失,如何将元数据同步的工作人员提升为协调员

[英]Citus How to promote metadatasynced worker to coordinator if coordinator completely lost

I have a citus cluster with 1 coordinator and 3 workers.我有一个 citus 集群,有 1 名协调员和 3 名工作人员。

Recently, the server which hosted coordinator is completely down which cannot be recovered.最近,托管协调器的服务器完全关闭,无法恢复。

Now the cluster only do Query and Some DML but not DDL which can only be executed on coordinator!现在集群只做 Query 和 Some DML 而不是 DDL,只能在 coordinator 上执行!

So, How cound I promote a metadatasynced worker to coordinator?那么,我如何将元数据同步的工作人员提升为协调员?

Firstly, there is no accepted solution we currently have for such scenarios.首先,对于这种情况,我们目前没有公认的解决方案。 Typically, this would have to be resolved by restoring the coordinator using a backup.通常,这必须通过使用备份恢复协调器来解决。 It is highly recommended to setup your database to take periodic backups and be able to restore them in case of such unrecoverable crashes.强烈建议您将数据库设置为定期备份,并在发生此类不可恢复的崩溃时能够恢复它们。

This solution does not promote a worker but establishes a new coordinator.这个解决方案不会提升一个工人,而是建立一个新的协调者。 Complications might arise (your database might break) if you apply this to your own cluster because this involves messing with citus' metadata.如果您将其应用于您自己的集群,可能会出现并发症(您的数据库可能会损坏),因为这涉及到弄乱 citus 的元数据。 This solution is experimental, so I highly recommend you try to apply this to a fork of your cluster or to snapshot your disks to ensure no damage is done to your data.此解决方案是实验性的,因此我强烈建议您尝试将其应用于集群的分支或对磁盘进行快照,以确保不会损坏您的数据。

  • This solution has been tested in a limited scope for citus 11 in a local cluster containing a distributed and a reference table.此解决方案已在包含分布式和参考表的本地集群中的citus 11的有限 scope 中进行了测试。 If you have other distributed objects in your database like views or collations this solution may fail (haven't tested).如果您的数据库中有其他分布式对象,例如视图或排序规则,则此解决方案可能会失败(尚未测试)。

  • Throughout this you will need to lookup groupid-s and nodeid-s with SELECT * FROM pg_dist_node;在整个过程中,您需要使用SELECT * FROM pg_dist_node;

  • Some steps need to be included/excluded if the old coordinator is in the pg_dist_node table.如果旧协调器在 pg_dist_node 表中,则需要包含/排除一些步骤。 I will mark such steps with [if CIM include] for steps that need to be included and [if CIM exclude] for steps that need to be excluded if the old coordinator is in the pg_dist_table.如果旧的协调器在 pg_dist_table 中,我将用 [if CIM include] 标记需要包含的步骤,使用 [if CIM exclude] 标记需要排除的步骤。 (CIM = coordinator in metadata) (CIM = 元数据中的协调器)

  • Create a new node and install citus.创建一个新节点并安装 citus。

  • [if CIM include] Run in all your nodes except the new one you just created: [如果 CIM 包括] 在所有节点中运行,除了您刚刚创建的新节点:

-- removes old coordinator from citus metadata
DELETE FROM pg_dist_node WHERE groupid = 0;

-- removes old coordinator placements from citus metadata
DELETE FROM pg_dist_placement WHERE groupid = 0;
  • Connect to your node in group 1 (look the node up in pg_dist_node):连接到组 1 中的节点(在 pg_dist_node 中查找节点):
-- temporarily mark this node as the coordinator in the metadata
UPDATE pg_dist_local_group SET groupid = 0;
UPDATE pg_dist_node SET groupid = 0 WHERE groupid = 1;
UPDATE pg_dist_placement SET groupid = 0 WHERE groupid = 1;

-- adjust the metadata as if we are ready to add new node
SELECT max(groupid) as groupid FROM pg_dist_node \gset
SELECT setval('pg_dist_groupid_seq', :groupid, true);
SELECT max(nodeid) as nodeid FROM pg_dist_node \gset
SELECT setval('pg_dist_node_nodeid_seq', :nodeid, true);
SELECT max(placementid) as placementid FROM pg_dist_placement \gset
SELECT setval('pg_dist_placement_placementid_seq', :placementid, true);

-- add the new node
SELECT citus_add_node('NEW_NODE_HOST', NEW_NODE_PORT);

-- set back the original metadata
UPDATE pg_dist_local_group SET groupid = 1;
UPDATE pg_dist_node SET groupid = 1 WHERE groupid = 0;
UPDATE pg_dist_placement SET groupid = 1 WHERE groupid = 0;

-- look up your new node's group id
-- this value is important and I will refer to it in future steps as NEW_NODE_GROUP_ID
SELECT * FROM pg_dist_node;
  • Connect and run in your new node:连接并在您的新节点中运行:
-- restore the metadata of your temporary coodinator
UPDATE pg_dist_node SET groupid = 1 WHERE groupid = 0;
UPDATE pg_dist_placement SET groupid = 1 WHERE groupid = 0;

-- set the new node as the coordinator
UPDATE pg_dist_local_group SET groupid = 0;

-- set the metadata for the new coordinator
SELECT max(groupid) as groupid FROM pg_dist_node \gset
SELECT setval('pg_dist_groupid_seq', :groupid, true);
SELECT max(nodeid) as nodeid FROM pg_dist_node \gset
SELECT setval('pg_dist_node_nodeid_seq', :nodeid, true);
SELECT max(placementid) as placementid FROM pg_dist_placement \gset
SELECT setval('pg_dist_placement_placementid_seq', :placementid, true);
SELECT max(shardid) as shardid FROM pg_dist_shard \gset
SELECT setval('pg_dist_shardid_seq', :shardid, true);

  • [if CIM exclude] Run in the new node: [如果 CIM 排除] 在新节点中运行:
-- if the coordinator is not in the metadata, all shards in the new node which will be the coordinator
-- need to be dropped
SET citus.enable_manual_changes_to_shards TO true;
DO $$
DECLARE
    row record;
BEGIN
    FOR row IN 
    SELECT CONCAT(logicalrelid, '_', shardid) AS shard_name 
    FROM pg_dist_placement NATURAL JOIN pg_dist_shard 
    WHERE groupid = NEW_NODE_GROUP_ID
    LOOP
        EXECUTE 'DROP TABLE ' || quote_ident(row.shard_name);
        RAISE INFO 'Dropped shard: %', quote_ident(row.shard_name);
    END LOOP;
END;
$$;
RESET citus.enable_manual_changes_to_shards;
  • Run in all of the nodes:在所有节点中运行:
-- if coordinator is not in the metadata remove the new node from the pg_dist_node
-- and remove its shard placements from the metadata
DELETE FROM pg_dist_node WHERE groupid = NEW_NODE_GROUP_ID; -- [if CIM exclude]
DELETE FROM pg_dist_placement WHERE groupid = NEW_NODE_GROUP_ID; -- [if CIM exclude]

-- if the coordinator is in the metadata set the new node as coordinator, set shouldhaveshars
-- for the coordinator to False, and update the coordinator placement group ids
UPDATE pg_dist_node SET groupid = 0, shouldhaveshards = False WHERE groupid = NEW_NODE_GROUP_ID; -- [if CIM include]
UPDATE pg_dist_placement SET groupid = 0 WHERE groupid = NEW_NODE_GROUP_ID; -- [if CIM include]

This should be it, your new node is now the coordinator.应该是这样,您的新节点现在是协调器。 Good luck:)祝你好运:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM