简体   繁体   English

cassandra中的全表扫描问题

[英]Issue in full table scan in cassandra

First: I know isn't a good idea do a full scan in Cassandra, however, at moment, is that what I need. 第一:我知道在Cassandra进行全面扫描并不是一个好主意,但是,目前,这就是我需要的。

When I started look for do someting like this I read people saying wasn't possible do a full scan in Cassandra and he wasn't made to do this type of thing. 当我开始寻找像这样做的东西时,我读到人们说不可能在卡桑德拉进行全面扫描而且他没有做这种事情。

Not satisfied, I keep looking until I found this article: http://www.myhowto.org/bigdata/2013/11/04/scanning-the-entire-cassandra-column-family-with-cql/ 不满意,我一直在寻找,直到找到这篇文章: http//www.myhowto.org/bigdata/2013/11/04/scanning-the-entire-cassandra-column-family-with-cql/

Look like pretty reasonable and I gave it a try. 看起来很合理,我试一试。 As I will do this full scan only once and time and performance isn't a issue, I wrote the query and put this in a simple Job to lookup all the records that I want. 因为我将只执行一次全扫描,时间和性能不是问题,我编写了查询并将其放在一个简单的Job中查找我想要的所有记录。 From 2 billions rows of records, something like 1000 was my expected output, however, I had only 100 records. 从20亿行记录中,1000个是我预期的输出,但是,我只有100条记录。

My job: 我的工作:

public void run() {
    Cluster cluster = getConnection();
    Session session = cluster.connect("db");

    LOGGER.info("Starting ...");

    boolean run = true;
    int print = 0;

    while ( run ) {
        if (maxTokenReached(actualToken)) {
            LOGGER.info("Max Token Reached!");
            break;
        }
        ResultSet resultSet = session.execute(queryBuilder(actualToken));

        Iterator<Row> rows = resultSet.iterator();
        if ( !rows.hasNext()){
            break;
        }

        List<String> rowIds = new ArrayList<String>();

        while (rows.hasNext()) {
            Row row = rows.next();

            Long leadTime = row.getLong("my_column");
            if (myCondition(myCollumn)) {
                String rowId = row.getString("key");
                rowIds.add(rowId);
            }

            if (!rows.hasNext()) {
                Long token = row.getLong("token(rowid)");
                if (!rowIds.isEmpty()) {
                    LOGGER.info(String.format("Keys found! RowId's: %s ", rowIds));
                }
                actualToken = nextToken(token);
            }

        }

    }
    LOGGER.info("Done!");
    cluster.shutdown();
}

public boolean maxTokenReached(Long actualToken){
    return actualToken >= maxToken;
}

public String queryBuilder(Long nextRange) {
    return String.format("select token(key), key, my_column from mytable where token(key) >= %s limit 10000;", nextRange.toString());
}

public Long nextToken(Long token){
    return token + 1;
}

Basically what I do is search for the min token allowed and incrementally go until the last. 基本上我所做的是搜索允许的最小令牌,并逐步进行到最后一次。

I don't know, but is like the job had not done the full-scan totally or my query had only accessed only one node or something. 我不知道,但就像工作没有完全扫描完全扫描或我的查询只访问过一个节点或东西。 I don't know if I'm doing something wrong, or is not really possible do a full scan. 我不知道我做错了什么,或者是不是真的可以进行全面扫描。

Today I have almost 2 TB of data, only one table in one cluster of seven nodes. 今天我有近2 TB的数据,在一个七个节点的集群中只有一个表。

Someone already has been in this situation or have some recommendation? 有人已经处于这种情况或有一些建议吗?

It's definitely possible to do a full table scan in Cassandra - indeed, it's quite common for things like Spark. 绝对有可能在Cassandra中进行全表扫描 - 事实上,它对于像Spark这样的事情来说很常见。 However, it's not typically "fast", so it's discouraged unless you know why you're doing it. 然而,它通常不是“快速”,所以除非你知道你为什么这样做,否则它是气馁的。 For your actual questions: 对于您的实际问题:

1) If you're using CQL, you're almost certainly using Murmur3 partitioner, so your minimum token is -9223372036854775808 (and maximum token is 9223372036854775808). 1)如果您使用的是CQL,那么您几乎肯定会使用Murmur3分区程序,因此您的最小标记为-9223372036854775808(最大标记为9223372036854775808)。

2) You're using session.execute(), which will use a default consistency of ONE, which may not return all of the results in your cluster, especially if you're also writing at ONE, which I suspect you may be. 2)你正在使用session.execute(),它将使用一个默认的一致性,这可能不会返回你的集群中的所有结果,特别是如果你也在ONE编写,我怀疑你可能会这样。 Raise that to ALL, and use prepared statements to speed up the CQL parsing: 将其提升为ALL,并使用预准备语句来加速CQL解析:

 public void run() {
     Cluster cluster = getConnection();
     Session session = cluster.connect("db");
     LOGGER.info("Starting ...");
     actualToken = -9223372036854775808;
     boolean run = true;
     int print = 0;

     while ( run ) {
         if (maxTokenReached(actualToken)) {
             LOGGER.info("Max Token Reached!");
             break;
         }
         SimpleStatement stmt = new SimpleStatement(queryBuilder(actualToken));
         stmt.setConsistencyLevel(ConsistencyLevel.ALL);
         ResultSet resultSet = session.execute(stmt);

         Iterator<Row> rows = resultSet.iterator();
         if ( !rows.hasNext()){
             break;
         }

         List<String> rowIds = new ArrayList<String>();

         while (rows.hasNext()) {
             Row row = rows.next();

             Long leadTime = row.getLong("my_column");
             if (myCondition(myCollumn)) {
                 String rowId = row.getString("key");
                 rowIds.add(rowId);
             }

             if (!rows.hasNext()) {
                 Long token = row.getLong("token(rowid)");
                 if (!rowIds.isEmpty()) {
                     LOGGER.info(String.format("Keys found! RowId's: %s ", rowIds));
                 }
             actualToken = nextToken(token);
             }
         }
      }
     LOGGER.info("Done!");
     cluster.shutdown(); 
  }

public boolean maxTokenReached(Long actualToken){
     return actualToken >= maxToken; 
 }

 public String queryBuilder(Long nextRange) {
     return String.format("select token(key), key, my_column from mytable where token(key) >= %s limit 10000;", nextRange.toString()); 
 }

 public Long nextToken(Long token) {
     return token + 1; 
 }

I'd highly recommend using Spark - even in a stand alone application (ie without a cluster). 我强烈建议使用Spark - 即使在独立的应用程序中(即没有集群)。 It'll take care of chunking up the partitions and process them one by one. 它将负责分区并逐个处理它们。 Dead easy to use too: 死也易于使用:

https://github.com/datastax/spark-cassandra-connector https://github.com/datastax/spark-cassandra-connector

Is this for a common thing you need to do? 这是你需要做的常见事吗? Or a one case scenario? 还是一个案例? I agree this is not a advisable thing you want to do on a regular basis, but I also had an issue where I had to read through all rows from a ColumnFamily and I relied on AllRowsReader recipe from Astyanax client . 我同意这不是你想要定期做的事情,但我也有一个问题,我必须阅读ColumnFamily的所有行,我依赖Astyanax客户端的 AllRowsReader配方 I'm seeing that you are using Datastax CQL driver to connect to your cluster, but if what you're looking is something that is proved to work, you might not care dealing with problem using Astyanax library. 我看到你正在使用Datastax CQL驱动程序连接到你的集群,但如果你正在寻找的东西被证明是有用的,你可能不关心使用Astyanax库处理问题。

In my case I used to read all row keys and then I had another job to interact with the ColumnFamily with the keys I collected. 在我的情况下,我曾经阅读所有行键,然后我有另一个工作与我收集的键与ColumnFamily交互。

import com.netflix.astyanax.Keyspace;
import com.netflix.astyanax.model.ColumnFamily;
import com.netflix.astyanax.model.ConsistencyLevel;
import com.netflix.astyanax.recipes.reader.AllRowsReader;

import java.util.concurrent.CopyOnWriteArrayList;

...        

private final Keyspace keyspace;
private final ColumnFamily<String, byte[]> columnFamily;

public List<String> getAllKeys() throws Exception {

    final List<String> rowKeys = new CopyOnWriteArrayList<>();

    new AllRowsReader.Builder<>(keyspace, columnFamily).withColumnRange(null, null, false, 0)
        .withPartitioner(null) // this will use keyspace's partitioner
        .withConsistencyLevel(ConsistencyLevel.CL_ONE).forEachRow(row -> {
        if (row == null) {
            return true;
        }

        String key = row.getKey();

        rowKeys.add(key);

        return true;
    }).build().call();

    return rowKeys;
}

There are different configuration options to run this in several threads and many other things, like I said I just ran this once in my code and worked really well, I'd be happy to help if you ran into issues trying it to make it work. 有几个不同的配置选项可以在多个线程和许多其他东西中运行它,比如我说我只是在我的代码中运行了一次并且工作得非常好,如果你遇到问题试图让它工作,我很乐意提供帮助。

Hope this helps, 希望这可以帮助,

José Luis 何塞路易斯

If you regularly need to do full table scans of a Cassandra table, say for analytics in Spark, then I highly suggest you consider storing your data using a data model that is read-optimized. 如果您经常需要对Cassandra表进行全表扫描,比如Spark中的分析,那么我强烈建议您考虑使用读取优化的数据模型来存储数据。 You can check out http://github.com/tuplejump/FiloDB for an example of a read-optimized setup on Cassandra. 您可以访问http://github.com/tuplejump/FiloDB查看Cassandra上的读取优化设置示例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM