简体   繁体   中英

Slow when querying cassandra with apache spark in Java.

I'm still new with No SQL solution and I just starting to learn nosql since few months ago.

I have a project and it was built by spring boot framework and has a DAO layer. My database was cassandra and I'm using datastax java cassandra driver to communicate. I found cassandra or maybe all nosql key/value solutions don't support for case sensitive and query with "like%" use cases. After done some research through stackoverflow and other forums, figure out those have to use some tools like apache spark, elastic search, or apache lucene to dig the data in cassandra. So that I chose apache spark and i'm not sure whether the code should be done in this way (in term of best practice).

Here's my code to query data:

@Override
    public Login getLoginByEmail(String shopId, String email) throws InterruptedException, ExecutionException {

        JavaFutureAction<List<Login>> loginRDDFuture = javaFunctions(getSparkContext())
                .cassandraTable("shop_abc", "app_login", loginRowReader)
                .filter(new Function<Login, Boolean>() {

                    private static final long serialVersionUID = 1L;

                    @Override
                    public Boolean call(Login login) throws Exception {
                        return login.getEmail().equalsIgnoreCase(email.trim());
                    }
                }).collectAsync();

        List<Login> lgnList = loginRDDFuture.get();

        if(lgnList.size() > 0){
            return lgnList.get(0);
        }

        return null;
    }

It took me 9 seconds to get the result and database only with a table and 3 records. I would think that what happen if the database if more than million records.

I'm not sure whether this is good practice or it has better way or better tools to do that, I hope someone can give me a guidance.

Appreciate.

I think this kind of query will be rather slow because it has to retrieve all data from your C* database, breaking up queries by token ranges and mapping them into RDDs and then filters through them using a spark job. That is going to have some overhead even when your data set is small, although 9 seconds does seem like quite a while, but hard to know why without knowing more about your environment.

Alternatively, have you considered using SSTable Attached Secondary Indices (SASI) ? SASI was introduced in C* 3.4 and allow you to do LIKE % queries with cassandra with or without case sensitivity, ie:

CREATE CUSTOM INDEX fn_suffix_allcase ON cyclist_name (firstname) 
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = { 
  'mode': 'CONTAINS',
  'analyzer_class':'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
  'case_sensitive': 'false'
};

A good talk for reference on SASI is SASI: Cassandra on the Full Text Search Ride .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM