[英]Map a table of a cassandra database using spark and RDD
i have to map a table in which is written the history of utilization of an app. 我必须映射一张表格,其中写入了应用程序的使用历史。 The table has got these tuples:
桌子上有这些元组:
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
AppId
is always different, because is referenced at many app, date
is expressed in this format dd/mm/yyyy hh/mm
cpuUsage
and memoryUsage
are expressed in %
so for example: AppId
总是不同的,因为在许多应用程序中都引用了它,所以date
以这种格式表示: dd/mm/yyyy hh/mm
cpuUsage
和memoryUsage
用%
表示,例如:
<3ghffh3t482age20304,230720142245,0.2,3,5>
I retrieved the data from cassandra in this way (little snippet): 我以这种方式(小片段)从cassandra中检索了数据:
public static void main(String[] args) {
Cluster cluster;
Session session;
cluster = Cluster.builder().addContactPoint("127.0.0.1").build();
session = cluster.connect();
session.execute("CREATE KEYSPACE IF NOT EXISTS foo WITH replication "
+ "= {'class':'SimpleStrategy', 'replication_factor':3};");
String createTableAppUsage = "CREATE TABLE IF NOT EXISTS foo.appusage"
+ "(appid text,date text, cpuusage double, memoryusage double, "
+ "PRIMARY KEY(appid,date) " + "WITH CLUSTERING ORDER BY (time ASC);";
session.execute(createTableAppUsage);
// Use select to get the appusage's table rows
ResultSet resultForAppUsage = session.execute("SELECT appid,cpuusage FROM foo.appusage");
for (Row row: resultForAppUsage)
System.out.println("appid :" + row.getString("appid") +" "+ "cpuusage"+row.getString("cpuusage"));
// Clean up the connection by closing it
cluster.close();
}
So, my problem now is to map the data by key value
and create a tuple integrating this code (snippet that's doesn't work): 所以,我现在的问题是
key value
映射数据并创建一个集成此代码的元组(无效的代码段):
<AppId,cpuusage>
JavaPairRDD<String, Integer> saveTupleKeyValue =someStructureFromTakeData.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String x) {
return new Tuple2(x, y);
}
how i can map appId and cpuusage using RDD and the reduce eg. cpuusage >50
我如何使用RDD和reduce来映射appId和cpuusage
eg. cpuusage >50
eg. cpuusage >50
? eg. cpuusage >50
?
Any help? 有什么帮助吗?
thanks in advance. 提前致谢。
Assuming that you have a valid SparkContext sparkContext
already created, have added the spark-cassandra connector dependencies to your project and configured your spark application to talk to your cassandra cluster (see docs for that), then we can load the data in an RDD like this: 假设您已经创建了一个有效的SparkContext
sparkContext
,并已将spark-cassandra连接器依赖项添加到您的项目中,并配置了spark应用程序以与您的cassandra集群进行通信(请参阅该文档 ),那么我们可以将数据加载到RDD中,例如这个:
val data = sparkContext.cassandraTable("foo", "appusage").select("appid", "cpuusage")
In Java, the idea is the same but it requires a bit more plumbing, described here 在Java中,思路是相同的,但它需要多一点的管道,描述在这里
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.