简体   繁体   English

Cassandra:如何将整个表移动到另一个键空间

[英]Cassandra: how to move whole tables to another keyspace

Version info of my Cassandra: 我的Cassandra的版本信息:

[cqlsh 5.0.1 | [cqlsh 5.0.1 | Cassandra 2.2.5 | 卡桑德拉2.2.5 | CQL spec 3.3.1 | CQL规范3.3.1 | Native protocol v4] 本机协议v4]

I am trying to move some huge tables (several million rows) to another keyspace. 我正在尝试将一些巨大的表(几百万行)移到另一个键空间。 Besides "COPY to csv, and COPY from csv", is there any better solution? 除了“复制到csv和从csv复制”之外,还有更好的解决方案吗?

Ok, I managed to get this to work on a single-node cluster running 2.2.8. 好的,我设法使它在运行2.2.8的单节点群集上工作。

I experimented by moving my holidays table from my presentation keyspace over to my stackoverflow keyspace. 我通过将holidays表从presentation键空间移到stackoverflow键空间进行了实验。

Here are the steps I took: 这是我采取的步骤:

Create the table inside the new keyspace. 在新的键空间内创建表。

This step is important, because each table has a UUID for a unique identifier, stored in the system.schema_columnfamilies table in the cf_id column. 此步骤很重要,因为每个表都有一个用于唯一标识符的UUID,该UUID存储在cf_id列的system.schema_columnfamilies表中。 This id is attached to the directory names that hold the data. 该ID附加到保存数据的目录名称上。 By copy/pasting the schema from one keyspace to another, you'll ensure that the same column names are used, but that a new unique identifier is generated. 通过将模式从一个键空间复制/粘贴到另一个键空间,您将确保使用相同的列名,但是会生成一个新的唯一标识符。

Note: In 3.x, the identifier is stored in the system_schema.tables table. 注意:在3.x中,标识符存储在system_schema.tables表中。

aploetz@cqlsh:stackoverflow> CREATE TABLE holidays (
 type text,
 eventtime timestamp,
 beginend text,
 name text,
 PRIMARY KEY (type, eventtime, beginend)
) WITH CLUSTERING ORDER BY (eventtime DESC, beginend DESC);

aploetz@cqlsh:stackoverflow> SELECT * FROM stackoverflow.holidays ;

 type | eventtime | beginend | name
------+-----------+----------+------

(0 rows)

Stop your node(s) properly (DISABLEGOSSIP, DRAIN, kill/stop, etc...). 正确停止您的节点(DISABLEGOSSIP,DRAIN,kill / stop等)。

Now, find the location of the old and new table on disk, and copy/move the files to the new location (from the old location): 现在,找到磁盘上旧表和新表的位置,然后将文件复制/移动到新位置 (从旧位置):

$ ls -al /var/lib/cassandra/data22/stackoverflow/holidays-77a767e0a5f111e6a2bebd9d201c4c8f/
total 12
drwxrwxr-x  3 aploetz aploetz 4096 Nov  8 14:25 .
drwxrwxr-x 17 aploetz aploetz 4096 Nov  8 14:25 ..
drwxrwxr-x  2 aploetz aploetz 4096 Nov  8 14:25 backups
$ cp /var/lib/cassandra/data22/presentation/holidays-74bcfde0139011e6a67c2575e6398503/la* /var/lib/cassandra/data22/stackoverflow/holidays-77a767e0a5f111e6a2bebd9d201c4c8f/

$ ls -al /var/lib/cassandra/data22/stackoverflow/holidays-77a767e0a5f111e6a2bebd9d201c4c8f/
drwxrwxr-x  3 aploetz aploetz 4096 Nov  8 14:26 .
drwxrwxr-x 17 aploetz aploetz 4096 Nov  8 14:25 ..
drwxrwxr-x  2 aploetz aploetz 4096 Nov  8 14:25 backups
-rw-rw-r--  1 aploetz aploetz   43 Nov  8 14:26 la-1-big-CompressionInfo.db
-rw-rw-r--  1 aploetz aploetz  628 Nov  8 14:26 la-1-big-Data.db
-rw-rw-r--  1 aploetz aploetz    9 Nov  8 14:26 la-1-big-Digest.adler32
-rw-rw-r--  1 aploetz aploetz   16 Nov  8 14:26 la-1-big-Filter.db
-rw-rw-r--  1 aploetz aploetz   57 Nov  8 14:26 la-1-big-Index.db
-rw-rw-r--  1 aploetz aploetz 4468 Nov  8 14:26 la-1-big-Statistics.db
-rw-rw-r--  1 aploetz aploetz   94 Nov  8 14:26 la-1-big-Summary.db
-rw-rw-r--  1 aploetz aploetz   94 Nov  8 14:26 la-1-big-TOC.txt
-rw-rw-r--  1 aploetz aploetz   43 Nov  8 14:26 la-2-big-CompressionInfo.db
-rw-rw-r--  1 aploetz aploetz  164 Nov  8 14:26 la-2-big-Data.db
-rw-rw-r--  1 aploetz aploetz   10 Nov  8 14:26 la-2-big-Digest.adler32
-rw-rw-r--  1 aploetz aploetz   16 Nov  8 14:26 la-2-big-Filter.db
-rw-rw-r--  1 aploetz aploetz   26 Nov  8 14:26 la-2-big-Index.db
-rw-rw-r--  1 aploetz aploetz 4460 Nov  8 14:26 la-2-big-Statistics.db
-rw-rw-r--  1 aploetz aploetz  108 Nov  8 14:26 la-2-big-Summary.db
-rw-rw-r--  1 aploetz aploetz   94 Nov  8 14:26 la-2-big-TOC.txt

Restart your node(s). 重新启动您的节点。

Query via cqlsh: 通过cqlsh查询:

Connected to SnakesAndArrows at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 2.2.8 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
aploetz@cqlsh> SELECT * FROM stackoverflow.holidays ;

 type         | eventtime                | beginend | name
--------------+--------------------------+----------+------------------------
    Religious | 2016-12-26 05:59:59+0000 |        E |              Christmas
    Religious | 2016-12-25 06:00:00+0000 |        B |              Christmas
    Religious | 2016-03-28 04:59:59+0000 |        E |                 Easter
    Religious | 2016-03-27 05:00:00+0000 |        B |                 Easter
 presentation | 2016-05-06 20:40:08+0000 |        B |        my presentation
 presentation | 2016-05-06 20:40:03+0000 |        B |        my presentation
 presentation | 2016-05-06 20:39:15+0000 |        B |        my presentation
 presentation | 2016-05-06 20:38:10+0000 |        B |        my presentation
           US | 2016-07-05 04:59:59+0000 |        E |            4th of July
           US | 2016-07-04 05:00:00+0000 |        B |            4th of July
           US | 2016-05-09 04:59:59+0000 |        E |            Mothers Day
           US | 2016-05-08 05:00:00+0000 |        B |            Mothers Day
         Nerd | 2016-12-22 05:59:59+0000 |        E |               2112 Day
         Nerd | 2016-12-21 06:00:00+0000 |        B |               2112 Day
         Nerd | 2016-09-26 04:59:59+0000 |        E |             Hobbit Day
         Nerd | 2016-09-25 05:00:00+0000 |        B |             Hobbit Day
         Nerd | 2016-09-20 04:59:59+0000 |        E | Talk Like a Pirate Day
         Nerd | 2016-09-19 05:00:00+0000 |        B | Talk Like a Pirate Day
         Nerd | 2016-05-07 04:59:59+0000 |        E |         Star Wars Week
         Nerd | 2016-05-04 05:00:00+0000 |        B |         Star Wars Week
         Nerd | 2016-03-14 05:00:00+0000 |        E |                 Pi Day
         Nerd | 2016-03-14 05:00:00+0000 |        B |                 Pi Day

(22 rows)

The problem with this approach, is that you will need to stop the cluster, and move files around on each node. 这种方法的问题在于,您将需要停止集群,并在每个节点上四处移动文件。 Whereas cqlsh COPY would allow you to import and export on a single node, with the cluster still running. 而cqlsh COPY允许您在单个节点上导入和导出,而群集仍在运行。

And I know that COPY has this reputation that limits it to smaller datasets. 而且我知道COPY具有这样的声誉,将它限制在较小的数据集上。 But 2.2.x has options that help throttle COPY to keep it from timing out on large datasets. 但是2.2.x的选项可以帮助限制COPY,以防止在大型数据集上超时。 I recently got it export/import 370 million rows without a timeout. 我最近得到了它的3.7亿行的导出/导入,没有超时。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM