[英]spark-cassandra-connector - repartitionByCassandraReplica returns empty RDD - Java
So, I have a 16 node cluster where every node has Spark and Cassandra installed while I am using the Spark-Cassandra Connector 3.0.0.因此,我有一个 16 节点集群,其中每个节点都安装了 Spark 和 Cassandra,同时我正在使用 Spark-Cassandra Connector 3.0.0。 I am trying to join a dataset with a cassandra table on the partition key, while also trying to use.repartitionByCassandraReplica.我正在尝试使用分区键上的 cassandra 表加入数据集,同时还尝试使用.repartitionByCassandraReplica。
However it seems I just get an empty rdd with 0 partitions(line 5 below)?然而,我似乎只是得到一个带有 0 个分区的空 rdd(下面的第 5 行)? Any ideas why?任何想法为什么?
Encoder<ExperimentForm> ExpEncoder = Encoders.bean(ExperimentForm.class);
//FYI experimentlist is a List<String>
Dataset<ExperimentForm> dfexplistoriginal = sp.createDataset(experimentlist, Encoders.STRING()).toDF("experimentid").as(ExpEncoder);
JavaRDD<ExperimentForm> predf = CassandraJavaUtil.javaFunctions(dfexplistoriginal.toJavaRDD()).repartitionByCassandraReplica("mdb","experiment",experimentlist.size(),CassandraJavaUtil.someColumns("experimentid"),CassandraJavaUtil.mapToRow(ExperimentForm.class));
System.out.println(predf.collect()); //Here it gives an empty dataset with 0 partitions
Dataset<ExperimentForm> newdfexplist = sp.createDataset(predf.rdd(), ExpEncoder);
Dataset<Row> readydfexplist = newdfexplist.as(Encoders.STRING()).toDF("experimentid");
Dataset<Row> metlistinitial = sp.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "mdb");
put("table", "experiment");
}
})
.load().select(col("experimentid"), col("description"), col("intensity")).join(readydfexplist, "experimentid");
In case needed this is the experiment table in Cassandra:如果需要,这是 Cassandra 中的实验表:
CREATE TABLE experiment(
experimentid varchar,
description text,
rt float,
intensity float,
mz float,
identifier text,
chemical_formula text,
filename text,
PRIMARY KEY ((experimentid),description, rt, intensity, mz, identifier, chemical_formula, filename));
and this is the ExperimentForm class:这是 ExperimentForm class:
public class ExperimentForm {
private String experimentid;
public String getExperimentid() {
return experimentid;
}
public void setExperimentid(String experimentid) {
this.experimentid = experimentid;
}
}
Let me know if you need any additional information.如果您需要任何其他信息,请告诉我。
The answer is basically the same as here Spark-Cassandra: repartitionByCassandraReplica or converting dataset to JavaRDD and back do not maintain number of partitions?答案与此处基本相同Spark-Cassandra: repartitionByCassandraReplica or converting dataset to JavaRDD and back do not maintain number of partitions?
Just had to do the repartitionByCassandraReplica and JoinWithCassandraTable on RDD and then convert back to dataset.只需在 RDD 上执行 repartitionByCassandraReplica 和 JoinWithCassandraTable,然后转换回数据集。
I am having the same problem?我有同样的问题? Did you manage to solve this?你设法解决了这个问题吗? Anyone?任何人?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.