简体   繁体   English

来自org.apache.spark.rdd.RDD [(((Any,Any),Iterable [org.apache.spark.sql.Row])的Spark Sql数据

[英]Spark Sql data from org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])]

I have org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])] data, 我有org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])]数据,
how to print data or get data? 如何打印数据或获取数据?

I have code like: 我有这样的代码:

val sessionsDF = Seq(("day1","user1","session1", 100.0),
  ("day1","user1","session2",200.0),
  ("day2","user1","session3",300.0),
  ("day2","user1","session4",400.0),
  ("day2","user1","session4",99.0)
).toDF("day","userId","sessionId","purchaseTotal").toDF()

val groupByData=sessionsDF.groupBy(x=>(x.get(0),x.get(1))
val filterData=groupByData.filter(x=>x._1._1=="day1").map(x=>x._2)

The above code is returning org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])] 上面的代码返回org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])]

In your first step, you have .toDF() extra. 第一步,您需要额外的.toDF() Correct one is as below 正确的一个如下

val sessionsDF = Seq(("day1","user1","session1", 100.0),
  ("day1","user1","session2",200.0),
  ("day2","user1","session3",300.0),
  ("day2","user1","session4",400.0),
  ("day2","user1","session4",99.0)
).toDF("day","userId","sessionId","purchaseTotal")

In your second step, you missed .rdd so the actual second step is 在第二步中,您错过了.rdd因此实际的第二步是

val groupByData=sessionsDF.rdd.groupBy(x=>(x.get(0),x.get(1)))

which has dataType as you mentioned in the question as 如您在问题中提到的那样具有dataType

scala> groupByData
res12: org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])] = ShuffledRDD[9] at groupBy at <console>:25

to view the groupByData rdd you can simply use foreach as 要查看groupByData rdd您可以简单地使用foreach作为

groupByData.foreach(println)

which would give you 这会给你

((day1,user1),CompactBuffer([day1,user1,session1,100.0], [day1,user1,session2,200.0]))
((day2,user1),CompactBuffer([day2,user1,session3,300.0], [day2,user1,session4,400.0], [day2,user1,session4,99.0]))

Now your third step is filtering data which has day1 as value for day column in your dataframe . 现在,您的第三步是过滤将day1值作为day column值的dataframe And you are taking only the values of the grouped rdd data. 而且,您仅采用分组的 rdd数据的值。

val filterData=groupByData.filter(x=>x._1._1=="day1").map(x=>x._2)

the returned dataType for this step is 此步骤返回的dataType

scala> filterData
res13: org.apache.spark.rdd.RDD[Iterable[org.apache.spark.sql.Row]] = MapPartitionsRDD[11] at map at <console>:27

You can use foreach as above to view the data as 您可以使用上面的foreach来查看数据

filterData.foreach(println)

which would give you 这会给你

CompactBuffer([day1,user1,session1,100.0], [day1,user1,session2,200.0])

You can see that the returned dataType is an RDD[Iterable[org.apache.spark.sql.Row]] so you can print each values using a map as 您可以看到返回的dataTypeRDD[Iterable[org.apache.spark.sql.Row]]因此您可以使用map将每个值打印为

filterData.map(x => x.map(y => println(y(0), y(1), y(2), y(3)))).collect

which would give you 这会给你

(day1,user1,session1,100.0)
(day1,user1,session2,200.0)

if you do only 如果你只做

filterData.map(x => x.map(y => println(y(0), y(3)))).collect

you would get 你会得到

(day1,100.0)
(day1,200.0)

I hope the answer is helpful 我希望答案是有帮助的

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何联接两个RDD:值联接不是org.apache.spark.rdd.RDD [org.apache.spark.sql.Row]的成员 - How to join two RDDs : value join is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] 将RDD [org.apache.spark.sql.Row]转换为RDD [org.apache.spark.mllib.linalg.Vector] - Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector] 类型不匹配; 发现:org.apache.spark.sql.DataFrame 需要:org.apache.spark.rdd.RDD - type mismatch; found : org.apache.spark.sql.DataFrame required: org.apache.spark.rdd.RDD 如何将 RDD[org.apache.spark.sql.Row] 转换为 RDD[org.apache.spark.mllib.linalg.Vector] - How to convert RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector] spark:值直方图不是org.apache.spark.rdd.RDD [Option [Any]]的成员 - spark: value histogram is not a member of org.apache.spark.rdd.RDD[Option[Any]] 我想将org.apache.spark.rdd.RDD [(Any,scala.collection.immutable.Iterable [String])]展平为Scala中的String - I want to flatten org.apache.spark.rdd.RDD[(Any, scala.collection.immutable.Iterable[String])] to String in Scala 必填:org.apache.spark.sql.Row - required: org.apache.spark.sql.Row org.apache.spark.sql.Row到Int - org.apache.spark.sql.Row to Int 在org.apache.spark.sql.Row上迭代 - Iterating on org.apache.spark.sql.Row 值mkString不是org.apache.spark.rdd.RDD [Int]的值 - value mkString is not a value of org.apache.spark.rdd.RDD[Int]
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM