I am a newbie to azure spark/ databricks and trying to access specific row eg 10th row in the dataframe.
This is what I did in notebook so far
1. Read a CSV file in a table
spark.read
.format("csv")
.option("header", "true")
.load("/mnt/training/enb/commonfiles/ramp.csv")
.write
.mode("overwrite")
.saveAsTable("ramp_csv")
2. Create a DataFrame for the "table" ramp_csv
val rampDF = spark.read.table("ramp_csv")
3. Read specific row
I am using the following logic in Scala
val myRow1st = rampDF.rdd.take(10).last
display(myRow1st)
and it should display 10th row but I am getting the following error
command-2264596624884586:9: error: overloaded method value display with alternatives:
[A](data: Seq[A])(implicit evidence$1: reflect.runtime.universe.TypeTag[A])Unit <and>
(dataset: org.apache.spark.sql.Dataset[_],streamName: String,trigger: org.apache.spark.sql.streaming.Trigger,checkpointLocation: String)Unit <and>
(model: org.apache.spark.ml.classification.DecisionTreeClassificationModel)Unit <and>
(model: org.apache.spark.ml.regression.DecisionTreeRegressionModel)Unit <and>
(model: org.apache.spark.ml.clustering.KMeansModel)Unit <and>
(model: org.apache.spark.mllib.clustering.KMeansModel)Unit <and>
(documentable: com.databricks.dbutils_v1.WithHelpMethods)Unit
cannot be applied to (org.apache.spark.sql.Row)
display(myRow1st)
^
Command took 0.12 seconds --
Could you please share what I am missing here? I tried few other things but it didn't work. Thanks in advance for help!
Here is the breakdown of what is happening in your code:
rampDF.rdd.take(10)
returns Array[Row]
.last
returns Row
display()
takes a Dataset
and you are passing it a Row
. You can use .show(10)
to display the first 10 rows in tabular form.
Another option is to do display(rampDF.limit(10))
I'd go with João's answer as well. But if you insist on getting the Nth row as a DataFrame
and avoid collecting to the driver node (say when N is very big) you can do:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = 1 to 100 toDF //sample data
val cols = df.columns
df
.limit(10)
.withColumn("id", monotonically_increasing_id())
.agg(max(struct(("id" +: cols).map(col(_)):_*)).alias("tenth"))
.select(cols.map(c => col("tenth."+c).alias(c)):_*)
This will return:
+-----+
|value|
+-----+
| 10|
+-----+
I also go with João Guitana's answer. An alternative to get specifically the 10'th record:
val df = 1 to 1000 toDF
val tenth = df.limit(10).collect.toList.last
tenth: org.apache.spark.sql.Row = [10]
That will return the 10th Row
on that df
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.