简体   繁体   English

使用spark和RDD映射Cassandra数据库的表

[英]Map a table of a cassandra database using spark and RDD

i have to map a table in which is written the history of utilization of an app. 我必须映射一张表格,其中写入了应用程序的使用历史。 The table has got these tuples: 桌子上有这些元组:

<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>

AppId is always different, because is referenced at many app, date is expressed in this format dd/mm/yyyy hh/mm cpuUsage and memoryUsage are expressed in % so for example: AppId总是不同的,因为在许多应用程序中都引用了它,所以date以这种格式表示: dd/mm/yyyy hh/mm cpuUsagememoryUsage%表示,例如:

<3ghffh3t482age20304,230720142245,0.2,3,5>

I retrieved the data from cassandra in this way (little snippet): 我以这种方式(小片段)从cassandra中检索了数据:

public static void main(String[] args) {
        Cluster cluster;
        Session session;
        cluster = Cluster.builder().addContactPoint("127.0.0.1").build();
        session = cluster.connect();
        session.execute("CREATE KEYSPACE IF NOT EXISTS foo WITH replication "
                + "= {'class':'SimpleStrategy', 'replication_factor':3};");
        String createTableAppUsage = "CREATE TABLE IF NOT EXISTS foo.appusage"
                + "(appid text,date text, cpuusage double, memoryusage double, "
                + "PRIMARY KEY(appid,date) " + "WITH CLUSTERING ORDER BY (time ASC);";
        session.execute(createTableAppUsage);
        // Use select to get the appusage's table rows
        ResultSet resultForAppUsage = session.execute("SELECT appid,cpuusage FROM foo.appusage");
       for (Row row: resultForAppUsage)
             System.out.println("appid :" + row.getString("appid") +" "+ "cpuusage"+row.getString("cpuusage"));
        // Clean up the connection by closing it
        cluster.close();
    }

So, my problem now is to map the data by key value and create a tuple integrating this code (snippet that's doesn't work): 所以,我现在的问题是key value映射数据并创建一个集成此代码的元组(无效的代码段):

        <AppId,cpuusage>

        JavaPairRDD<String, Integer> saveTupleKeyValue =someStructureFromTakeData.mapToPair(new PairFunction<String, String, Integer>() {
            public Tuple2<String, Integer> call(String x) {
                return new Tuple2(x, y);
            }

how i can map appId and cpuusage using RDD and the reduce eg. cpuusage >50 我如何使用RDD和reduce来映射appId和cpuusage eg. cpuusage >50 eg. cpuusage >50 ? eg. cpuusage >50

Any help? 有什么帮助吗?

thanks in advance. 提前致谢。

Assuming that you have a valid SparkContext sparkContext already created, have added the spark-cassandra connector dependencies to your project and configured your spark application to talk to your cassandra cluster (see docs for that), then we can load the data in an RDD like this: 假设您已经创建了一个有效的SparkContext sparkContext ,并已将spark-cassandra连接器依赖项添加到您的项目中,并配置了spark应用程序以与您的cassandra集群进行通信(请参阅该文档 ),那么我们可以将数据加载到RDD中,例如这个:

val data = sparkContext.cassandraTable("foo", "appusage").select("appid", "cpuusage")

In Java, the idea is the same but it requires a bit more plumbing, described here 在Java中,思路是相同的,但它需要多一点的管道,描述在这里

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM