从 spark dataframe 访问特定行

Question

I am a newbie to azure spark/ databricks and trying to access specific row eg 10th row in the dataframe.我是 azure spark/databricks 的新手，并试图访问特定行，例如 dataframe 中的第 10 行。

This is what I did in notebook so far到目前为止，这是我在笔记本上所做的

1. Read a CSV file in a table 1.在一个表中读取一个CSV文件

spark.read
  .format("csv")
  .option("header", "true")
  .load("/mnt/training/enb/commonfiles/ramp.csv")
  .write
  .mode("overwrite")
  .saveAsTable("ramp_csv")

2. Create a DataFrame for the "table" ramp_csv 2.为“表”ramp_csv创建一个DataFrame

val rampDF = spark.read.table("ramp_csv")

3. Read specific row 3.读取特定行

I am using the following logic in Scala我在 Scala 中使用以下逻辑

val myRow1st = rampDF.rdd.take(10).last

display(myRow1st)

and it should display 10th row but I am getting the following error它应该显示第 10 行，但我收到以下错误

command-2264596624884586:9: error: overloaded method value display with alternatives:
  [A](data: Seq[A])(implicit evidence$1: reflect.runtime.universe.TypeTag[A])Unit <and>
  (dataset: org.apache.spark.sql.Dataset[_],streamName: String,trigger: org.apache.spark.sql.streaming.Trigger,checkpointLocation: String)Unit <and>
  (model: org.apache.spark.ml.classification.DecisionTreeClassificationModel)Unit <and>
  (model: org.apache.spark.ml.regression.DecisionTreeRegressionModel)Unit <and>
  (model: org.apache.spark.ml.clustering.KMeansModel)Unit <and>
  (model: org.apache.spark.mllib.clustering.KMeansModel)Unit <and>
  (documentable: com.databricks.dbutils_v1.WithHelpMethods)Unit
 cannot be applied to (org.apache.spark.sql.Row)
display(myRow1st)
^
Command took 0.12 seconds --

Could you please share what I am missing here?你能分享一下我在这里缺少的东西吗？ I tried few other things but it didn't work.我尝试了一些其他的东西，但没有奏效。 Thanks in advance for help!提前感谢您的帮助！

Answer 1

Here is the breakdown of what is happening in your code:以下是代码中发生的情况的细分：

rampDF.rdd.take(10) returns Array[Row] rampDF.rdd.take(10)返回Array[Row]

.last returns Row .last返回Row

display() takes a Dataset and you are passing it a Row . display()接受一个Dataset并且你传递给它一个Row 。 You can use .show(10) to display the first 10 rows in tabular form.您可以使用.show(10)以表格形式显示前 10 行。

Another option is to do display(rampDF.limit(10))另一种选择是做display(rampDF.limit(10))

Answer 2

I'd go with João's answer as well.我也会用 João 的回答 go 。 But if you insist on getting the Nth row as a DataFrame and avoid collecting to the driver node (say when N is very big) you can do:但是，如果您坚持将第 N 行作为DataFrame并避免收集到驱动程序节点（例如当 N 很大时），您可以这样做：

import org.apache.spark.sql.functions._
import spark.implicits._

val df = 1 to 100 toDF //sample data
val cols = df.columns

df
.limit(10)
.withColumn("id", monotonically_increasing_id())
.agg(max(struct(("id" +: cols).map(col(_)):_*)).alias("tenth"))
.select(cols.map(c => col("tenth."+c).alias(c)):_*)

This will return:这将返回：

+-----+
|value|
+-----+
|   10|
+-----+

Answer 3

I also go with João Guitana's answer.我还用 João Guitana 的回答 go。 An alternative to get specifically the 10'th record:获得第 10 条记录的替代方法：

val df = 1 to 1000 toDF
val tenth = df.limit(10).collect.toList.last
tenth: org.apache.spark.sql.Row = [10]

That will return the 10th Row on that df这将返回该df的第10 Row

从 spark dataframe 访问特定行

问题描述

3 个解决方案

解决方案1
2 已采纳 2019-10-24 22:43:43

解决方案2
0 2019-10-26 21:02:22

解决方案3
0 2019-10-30 16:54:40

从 spark dataframe 访问特定行

问题描述

3 个解决方案

解决方案1 2 已采纳 2019-10-24 22:43:43

解决方案2 0 2019-10-26 21:02:22

解决方案3 0 2019-10-30 16:54:40

解决方案1
2 已采纳 2019-10-24 22:43:43

解决方案2
0 2019-10-26 21:02:22

解决方案3
0 2019-10-30 16:54:40