dsbulk 卸载丢失的数据

Question

I'm using dsbulk 1.6.0 to unload data from cassandra 3.11.3.我正在使用dsbulk 1.6.0 从cassandra 3.11.3 卸载数据。

Each unload results in wildly different counts of rows.每次卸载都会导致截然不同的行数。 Here are results from 3 invocations of unload, on the same cluster, connecting to the same cassandra host.以下是 3 次卸载调用的结果，在同一个集群上，连接到同一个 cassandra 主机。 The table being unloaded is only ever appended, data is never deleted, so a decrease in unloaded rows should not occur.被卸载的表只会被附加，数据永远不会被删除，所以卸载行的减少不应该发生。 There are 3 cassandra databases in the cluster, and a replication factor of 3, so all data should be present on the chosen host.集群中有 3 个 cassandra 数据库，复制因子为 3，因此所有数据都应该存在于所选主机上。 Furthermore, these were executed in quick succession, the number of added rows would be in the hundreds (if there were any) not in the tens of thousands.此外，这些是快速连续执行的，添加的行数将是数百（如果有的话）而不是数万。

Run 1:运行 1：

│ total | │ 总计 | failed |失败 | rows/s |行/秒 | p50ms | p50ms | p99ms | p99ms | p999ms p999ms
│ 10,937 | │ 10,937 | 7 | 7 | 97 | 97 | 15,935.46 | 15,935.46 | 20,937.97 | 20,937.97 | 20,937.97 20,937.97
│ Operation UNLOAD_20201024-084213-097267 completed with 7 errors in 1 minute and 51 seconds. │ UNLOAD_20201024-084213-097267 操作在 1 分 51 秒内完成，出现 7 个错误。

Run 2:运行 2：

│ total | │ 总计 | failed |失败 | rows/s |行/秒 | p50ms | p50ms | p99ms | p99ms | p999ms p999ms
│ 60,558 | │ 60,558 | 3 | 3 | 266 | 266 | 12,551.34 | 12,551.34 | 21,609.05 | 21,609.05 | 21,609.05 21,609.05
│ Operation UNLOAD_20201025-084208-749105 completed with 3 errors in 3 minutes and 47 seconds. │ UNLOAD_20201025-084208-749105 操作在 3 分 47 秒内完成，出现 3 个错误。

Run 3:运行 3：

│ total | │ 总计 | failed |失败 | rows/s |行/秒 | p50ms | p50ms | p99ms | p99ms | p999ms p999ms
│ 45,404 | │ 45,404 | 4 | 4 | 211 | 211| 16,664.92 | 16,664.92 | 30,870.08 | 30,870.08 | 30,870.08 30,870.08
│ Operation UNLOAD_20201026-084206-791305 completed with 4 errors in 3 minutes and 35 seconds. │ UNLOAD_20201026-084206-791305 操作在 3 分 35 秒内完成，出现 4 个错误。

It would appear that Run 1 is missing the majority of the data. Run 1似乎缺少大部分数据。 Run 2 may be closer to complete and Run 3 is missing significant data. Run 2可能更接近完成， Run 3缺少重要数据。

I'm invoking unload as follows:我按如下方式调用卸载：

dsbulk unload -h $CASSANDRA_IP -k $KEYSPACE -t $CASSANDRA_TABLE > $DATA_FILE

I'm assuming this isn't expected behaviour for dsbulk .我假设这不是dsbulk的预期行为。 How do I configure it to reliably unload a complete table without errors?如何配置它以可靠地卸载完整的表而不会出错？

Answer 1

Data could be missing from host if host wasn't reachable when the data was written, and hints weren't replayed, and you don't run repairs periodically.如果在写入数据时无法访问主机，并且没有重播提示，并且您没有定期运行修复，则主机可能会丢失数据。 And because DSBulk reads by default with consistency level LOCAL_ONE , different hosts will provide different views (the host that you're providing is just a contact point - after that the cluster topology will be discovered, and DSBulk will select replica based on the load balancing policy).并且因为 DSBulk 默认读取一致性级别LOCAL_ONE ，不同的主机将提供不同的视图（您提供的主机只是一个联系点 - 之后将发现集群拓扑，DSBulk 将 select 副本基于负载平衡政策）。

You can enforce that DSBulk read the data with another consistency level by using -cl command line option ( doc ).您可以使用-cl命令行选项 ( doc ) 强制 DSBulk 以另一个一致性级别读取数据。 You can compare results with using LOCAL_QUORUM or ALL - in these modes Cassandra will also "fix" the inconsistencies as they will be discovered, although this would be much slower & will add the load onto the nodes because of the repaired data writes.您可以将结果与使用LOCAL_QUORUM或ALL进行比较 - 在这些模式下，Cassandra 也将“修复”不一致，因为它们将被发现，尽管这会慢得多并且会因为修复的数据写入而将负载添加到节点上。

dsbulk 卸载丢失的数据

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-10-26 19:06:52

dsbulk 卸载丢失的数据

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-10-26 19:06:52

解决方案1
3 已采纳 2020-10-26 19:06:52