[英]Access specific row from spark dataframe
I am a newbie to azure spark/ databricks and trying to access specific row eg 10th row in the dataframe.我是 azure spark/databricks 的新手,并试图访问特定行,例如 dataframe 中的第 10 行。
This is what I did in notebook so far到目前为止,这是我在笔记本上所做的
1. Read a CSV file in a table 1.在一个表中读取一个CSV文件
spark.read
.format("csv")
.option("header", "true")
.load("/mnt/training/enb/commonfiles/ramp.csv")
.write
.mode("overwrite")
.saveAsTable("ramp_csv")
2. Create a DataFrame for the "table" ramp_csv 2.为“表”ramp_csv创建一个DataFrame
val rampDF = spark.read.table("ramp_csv")
3. Read specific row 3.读取特定行
I am using the following logic in Scala我在 Scala 中使用以下逻辑
val myRow1st = rampDF.rdd.take(10).last
display(myRow1st)
and it should display 10th row but I am getting the following error它应该显示第 10 行,但我收到以下错误
command-2264596624884586:9: error: overloaded method value display with alternatives:
[A](data: Seq[A])(implicit evidence$1: reflect.runtime.universe.TypeTag[A])Unit <and>
(dataset: org.apache.spark.sql.Dataset[_],streamName: String,trigger: org.apache.spark.sql.streaming.Trigger,checkpointLocation: String)Unit <and>
(model: org.apache.spark.ml.classification.DecisionTreeClassificationModel)Unit <and>
(model: org.apache.spark.ml.regression.DecisionTreeRegressionModel)Unit <and>
(model: org.apache.spark.ml.clustering.KMeansModel)Unit <and>
(model: org.apache.spark.mllib.clustering.KMeansModel)Unit <and>
(documentable: com.databricks.dbutils_v1.WithHelpMethods)Unit
cannot be applied to (org.apache.spark.sql.Row)
display(myRow1st)
^
Command took 0.12 seconds --
Could you please share what I am missing here?你能分享一下我在这里缺少的东西吗? I tried few other things but it didn't work.我尝试了一些其他的东西,但没有奏效。 Thanks in advance for help!提前感谢您的帮助!
Here is the breakdown of what is happening in your code:以下是代码中发生的情况的细分:
rampDF.rdd.take(10)
returns Array[Row]
rampDF.rdd.take(10)
返回Array[Row]
.last
returns Row
.last
返回Row
display()
takes a Dataset
and you are passing it a Row
. display()
接受一个Dataset
并且你传递给它一个Row
。 You can use .show(10)
to display the first 10 rows in tabular form.您可以使用.show(10)
以表格形式显示前 10 行。
Another option is to do display(rampDF.limit(10))
另一种选择是做display(rampDF.limit(10))
I'd go with João's answer as well.我也会用 João 的回答 go 。 But if you insist on getting the Nth row as a DataFrame
and avoid collecting to the driver node (say when N is very big) you can do:但是,如果您坚持将第 N 行作为DataFrame
并避免收集到驱动程序节点(例如当 N 很大时),您可以这样做:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = 1 to 100 toDF //sample data
val cols = df.columns
df
.limit(10)
.withColumn("id", monotonically_increasing_id())
.agg(max(struct(("id" +: cols).map(col(_)):_*)).alias("tenth"))
.select(cols.map(c => col("tenth."+c).alias(c)):_*)
This will return:这将返回:
+-----+
|value|
+-----+
| 10|
+-----+
I also go with João Guitana's answer.我还用 João Guitana 的回答 go。 An alternative to get specifically the 10'th record:获得第 10 条记录的替代方法:
val df = 1 to 1000 toDF
val tenth = df.limit(10).collect.toList.last
tenth: org.apache.spark.sql.Row = [10]
That will return the 10th Row
on that df
这将返回该df
的第10 Row
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.