简体   繁体   English

Scala:使用 spark 从 scylla 获取数据

[英]Scala: get data from scylla using spark

scala/spark newbie here.斯卡拉/火花新手在这里。 I have inherited an old code which I have refactored and been trying to use in order to retrieve data from Scylla.我继承了一个旧代码,我已经重构并尝试使用它来从 Scylla 检索数据。 The code looks like:代码如下所示:

val TEST_QUERY = s"SELECT user_id FROM test_table WHERE name = ? AND id_type = 'test_type';"

var selectData = List[Row]()
dataRdd.foreachPartition {
  iter => {
    // Build up a cluster that we can connect to
    // Start a session with the cluster by connecting to it.
    val cluster = ScyllaConnector.getCluster(clusterIpString, scyllaPreferredDc, scyllaUsername, scyllaPassword)
    var batchCounter = 0

    val session = cluster.connect(tableConfig.keySpace)
    val preparedStatement: PreparedStatement = session.prepare(TEST_QUERY)

    iter.foreach {
      case (test_name: String) => {
        // Get results
        val testResults = session.execute(preparedStatement.bind(test_name))
        if (testResults != null){
          val testResult = testResults.one()
          if(testResult != null){
            val user_id = testResult.getString("user_id")
            selectData ::= Row(user_id, test_name)
          }
        }
      }
    }
    session.close()
    cluster.close()
  }
}

println("Head is =======> ")
println(selectData.head)

The above does not return any data and fails with null pointer exception because the selectedData list is empty although there is data in there for sure that matches the select statement.上面没有返回任何数据并且失败并出现 null 指针异常,因为selectedData列表是空的,尽管其中肯定有与 select 语句匹配的数据。 I feel like how I'm doing it is not correct but can't figure out what needs to change in order to get this fixed so any help is much appreciated.我觉得我这样做是不正确的,但无法弄清楚需要改变什么才能解决这个问题,所以非常感谢任何帮助。

PS: The whole idea of me using a list to keep the results is so that I can use that list to create a dataframe. PS:我使用列表来保存结果的整个想法是,我可以使用该列表来创建 dataframe。 I'd be grateful if you could point me to the right direction here.如果您能在这里指出正确的方向,我将不胜感激。

If you look into the definition of the foreachPartition function , you will see that it's by definition can't return anything because its return type is void .如果您查看foreachPartition function的定义,您会看到它的定义不能返回任何东西,因为它的返回类型是void

Anyway, it's a very bad way of querying data from Cassandra/Scylla from Spark.无论如何,这是从 Spark 中查询 Cassandra/Scylla 数据的一种非常糟糕的方式。 For that exists Spark Cassandra Connector that should be able to work with Scylla as well because of the protocol compatibility.为此存在Spark Cassandra 连接器,由于协议兼容性,它也应该能够与 Scylla 一起使用。

To read a dataframe from Cassandra just do:要从Cassandra 读取 dataframe,只需执行以下操作:

spark.read
  .format("cassandra")
  .option("keyspace", "ksname")
  .option("table", "tab")
  .load()

Documentation is quite detailed, so just read it. 文档非常详细,所以请阅读它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM