简体   繁体   中英

Access specific row from spark dataframe

I am a newbie to azure spark/ databricks and trying to access specific row eg 10th row in the dataframe.

This is what I did in notebook so far

1. Read a CSV file in a table

spark.read
  .format("csv")
  .option("header", "true")
  .load("/mnt/training/enb/commonfiles/ramp.csv")
  .write
  .mode("overwrite")
  .saveAsTable("ramp_csv")

2. Create a DataFrame for the "table" ramp_csv

val rampDF = spark.read.table("ramp_csv")

3. Read specific row

I am using the following logic in Scala

val myRow1st = rampDF.rdd.take(10).last

display(myRow1st)

and it should display 10th row but I am getting the following error

command-2264596624884586:9: error: overloaded method value display with alternatives:
  [A](data: Seq[A])(implicit evidence$1: reflect.runtime.universe.TypeTag[A])Unit <and>
  (dataset: org.apache.spark.sql.Dataset[_],streamName: String,trigger: org.apache.spark.sql.streaming.Trigger,checkpointLocation: String)Unit <and>
  (model: org.apache.spark.ml.classification.DecisionTreeClassificationModel)Unit <and>
  (model: org.apache.spark.ml.regression.DecisionTreeRegressionModel)Unit <and>
  (model: org.apache.spark.ml.clustering.KMeansModel)Unit <and>
  (model: org.apache.spark.mllib.clustering.KMeansModel)Unit <and>
  (documentable: com.databricks.dbutils_v1.WithHelpMethods)Unit
 cannot be applied to (org.apache.spark.sql.Row)
display(myRow1st)
^
Command took 0.12 seconds --

Could you please share what I am missing here? I tried few other things but it didn't work. Thanks in advance for help!

Here is the breakdown of what is happening in your code:

rampDF.rdd.take(10) returns Array[Row]

.last returns Row

display() takes a Dataset and you are passing it a Row . You can use .show(10) to display the first 10 rows in tabular form.

Another option is to do display(rampDF.limit(10))

I'd go with João's answer as well. But if you insist on getting the Nth row as a DataFrame and avoid collecting to the driver node (say when N is very big) you can do:

import org.apache.spark.sql.functions._
import spark.implicits._

val df = 1 to 100 toDF //sample data
val cols = df.columns

df
.limit(10)
.withColumn("id", monotonically_increasing_id())
.agg(max(struct(("id" +: cols).map(col(_)):_*)).alias("tenth"))
.select(cols.map(c => col("tenth."+c).alias(c)):_*)

This will return:

+-----+
|value|
+-----+
|   10|
+-----+

I also go with João Guitana's answer. An alternative to get specifically the 10'th record:

val df = 1 to 1000 toDF
val tenth = df.limit(10).collect.toList.last
tenth: org.apache.spark.sql.Row = [10]

That will return the 10th Row on that df

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM