[英]Create Spark Row in a map
I saw a Dataframes tutorial at https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html which is written in Python. 我在https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html上看到了一个用Python编写的Dataframes教程。 I am trying to translate it into Scala.
我试图将其翻译成Scala。
They have the following code: 他们有以下代码:
df = context.load("/path/to/people.json")
# RDD-style methods such as map, flatMap are available on DataFrames
# Split the bio text into multiple words.
words = df.select("bio").flatMap(lambda row: row.bio.split(" "))
# Create a new DataFrame to count the number of words
words_df = words.map(lambda w: Row(word=w, cnt=1)).toDF()
word_counts = words_df.groupBy("word").sum()
So, I first read the data from a csv
into a dataframe df
and then I have: 所以,我首先将
csv
的数据读入数据帧df
然后我有:
val title_words = df.select("title").flatMap { row =>
row.getAs[String("title").split(" ") }
val title_words_df = title_words.map( w => Row(w,1) ).toDF()
val word_counts = title_words_df.groupBy("word").sum()
but I don't know: 但我不知道:
how to assign the field names to the rows in the line beginning with val title_words_df = ... 如何将字段名称分配给以val开头的行中的行title_words_df = ...
I am having the error "The value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]" 我有错误“值toDF不是org.apache.spark.rdd.RDD [org.apache.spark.sql.Row]的成员”
Thanks in advance for the help. 在此先感谢您的帮助。
how to assign the field names to the rows
如何将字段名称分配给行
Python Row
is quite different type of object than its Scala counterpart. Python
Row
是与Scala对应物完全不同的对象类型。 It is a tuple augmented with names which makes it more equivalent to product type than untyped collection ( oassql.Row
). 它是一个用名称扩充的元组,它使得它比无类型集合(
oassql.Row
) oassql.Row
产品类型。
I am having the error "The value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]"
我有错误“值toDF不是org.apache.spark.rdd.RDD [org.apache.spark.sql.Row]的成员”
Since oassql.Row
is basically untyped it cannot be used with toDF
and requires createDataFrame
with explicit schema. 由于
oassql.Row
基本上是无类型的,因此无法与toDF
一起使用,并且需要具有显式模式的createDataFrame
。
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("word", StringType), StructField("cnt", LongType)
))
sqlContext.createDataFrame(title_words.map(w => Row(w, 1L)), schema)
If you want your code to be equivalent to the Python version you should use product types instead of Row
. 如果您希望代码与Python版本等效,则应使用产品类型而不是
Row
。 It means either a Tuple
: 它意味着要么是一个
Tuple
:
title_words.map((_, 1L)).toDF("word", "cnt")
or case class: 或案例类:
case class Record(word: String, cnt: Long)
title_words.map(Record(_, 1L)).toDF
In practice though, there should be no need for using RDDs: 但实际上,不需要使用RDD:
import org.apache.spark.sql.functions.{explode, lit, split}
df.select(explode(split($"title", " ")), lit(1L))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.