在地图中创建Spark Row

Question

I saw a Dataframes tutorial at https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html which is written in Python. 我在https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html上看到了一个用Python编写的Dataframes教程。 I am trying to translate it into Scala. 我试图将其翻译成Scala。

They have the following code: 他们有以下代码：

df = context.load("/path/to/people.json")
# RDD-style methods such as map, flatMap are available on DataFrames
# Split the bio text into multiple words.
words = df.select("bio").flatMap(lambda row: row.bio.split(" "))
# Create a new DataFrame to count the number of words
words_df = words.map(lambda w: Row(word=w, cnt=1)).toDF()
word_counts = words_df.groupBy("word").sum()

So, I first read the data from a csv into a dataframe df and then I have: 所以，我首先将csv的数据读入数据帧df然后我有：

val title_words = df.select("title").flatMap { row =>    
  row.getAs[String("title").split(" ") }
val title_words_df = title_words.map( w => Row(w,1) ).toDF()
val word_counts = title_words_df.groupBy("word").sum()

but I don't know: 但我不知道：

how to assign the field names to the rows in the line beginning with val title_words_df = ... 如何将字段名称分配给以val开头的行中的行title_words_df = ...
I am having the error "The value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]" 我有错误“值toDF不是org.apache.spark.rdd.RDD [org.apache.spark.sql.Row]的成员”

Thanks in advance for the help. 在此先感谢您的帮助。

Answer 1

how to assign the field names to the rows 如何将字段名称分配给行

Python Row is quite different type of object than its Scala counterpart. Python Row是与Scala对应物完全不同的对象类型。 It is a tuple augmented with names which makes it more equivalent to product type than untyped collection ( oassql.Row ). 它是一个用名称扩充的元组，它使得它比无类型集合（ oassql.Row ） oassql.Row产品类型。

I am having the error "The value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]" 我有错误“值toDF不是org.apache.spark.rdd.RDD [org.apache.spark.sql.Row]的成员”

Since oassql.Row is basically untyped it cannot be used with toDF and requires createDataFrame with explicit schema. 由于oassql.Row基本上是无类型的，因此无法与toDF一起使用，并且需要具有显式模式的createDataFrame 。

import org.apache.spark.sql.types._

val schema = StructType(Seq(
  StructField("word", StringType), StructField("cnt", LongType)
))

sqlContext.createDataFrame(title_words.map(w => Row(w, 1L)), schema)

If you want your code to be equivalent to the Python version you should use product types instead of Row . 如果您希望代码与Python版本等效，则应使用产品类型而不是Row 。 It means either a Tuple : 它意味着要么是一个Tuple ：

title_words.map((_, 1L)).toDF("word", "cnt")

or case class: 或案例类：

case class Record(word: String, cnt: Long)

title_words.map(Record(_, 1L)).toDF

In practice though, there should be no need for using RDDs: 但实际上，不需要使用RDD：

import org.apache.spark.sql.functions.{explode, lit, split}

df.select(explode(split($"title", " ")), lit(1L))

在地图中创建Spark Row

问题描述

1 个解决方案

解决方案1
2 2016-05-26 03:55:23

在地图中创建Spark Row

问题描述

1 个解决方案

解决方案1 2 2016-05-26 03:55:23

解决方案1
2 2016-05-26 03:55:23