简体   繁体   English

在地图中创建Spark Row

[英]Create Spark Row in a map

I saw a Dataframes tutorial at https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html which is written in Python. 我在https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html上看到了一个用Python编写的Dataframes教程。 I am trying to translate it into Scala. 我试图将其翻译成Scala。

They have the following code: 他们有以下代码:

df = context.load("/path/to/people.json")
# RDD-style methods such as map, flatMap are available on DataFrames
# Split the bio text into multiple words.
words = df.select("bio").flatMap(lambda row: row.bio.split(" "))
# Create a new DataFrame to count the number of words
words_df = words.map(lambda w: Row(word=w, cnt=1)).toDF()
word_counts = words_df.groupBy("word").sum()

So, I first read the data from a csv into a dataframe df and then I have: 所以,我首先将csv的数据读入数据帧df然后我有:

val title_words = df.select("title").flatMap { row =>    
  row.getAs[String("title").split(" ") }
val title_words_df = title_words.map( w => Row(w,1) ).toDF()
val word_counts = title_words_df.groupBy("word").sum()

but I don't know: 但我不知道:

  1. how to assign the field names to the rows in the line beginning with val title_words_df = ... 如何将字段名称分配给以val开头的行中的行title_words_df = ...

  2. I am having the error "The value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]" 我有错误“值toDF不是org.apache.spark.rdd.RDD [org.apache.spark.sql.Row]的成员”

Thanks in advance for the help. 在此先感谢您的帮助。

how to assign the field names to the rows 如何将字段名称分配给行

Python Row is quite different type of object than its Scala counterpart. Python Row是与Scala对应物完全不同的对象类型。 It is a tuple augmented with names which makes it more equivalent to product type than untyped collection ( oassql.Row ). 它是一个用名称扩充的元组,它使得它比无类型集合( oassql.Rowoassql.Row产品类型。

I am having the error "The value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]" 我有错误“值toDF不是org.apache.spark.rdd.RDD [org.apache.spark.sql.Row]的成员”

Since oassql.Row is basically untyped it cannot be used with toDF and requires createDataFrame with explicit schema. 由于oassql.Row基本上是无类型的,因此无法与toDF一起使用,并且需要具有显式模式的createDataFrame

import org.apache.spark.sql.types._

val schema = StructType(Seq(
  StructField("word", StringType), StructField("cnt", LongType)
))

sqlContext.createDataFrame(title_words.map(w => Row(w, 1L)), schema)

If you want your code to be equivalent to the Python version you should use product types instead of Row . 如果您希望代码与Python版本等效,则应使用产品类型而不是Row It means either a Tuple : 它意味着要么是一个Tuple

title_words.map((_, 1L)).toDF("word", "cnt")

or case class: 或案例类:

case class Record(word: String, cnt: Long)

title_words.map(Record(_, 1L)).toDF

In practice though, there should be no need for using RDDs: 但实际上,不需要使用RDD:

import org.apache.spark.sql.functions.{explode, lit, split}

df.select(explode(split($"title", " ")), lit(1L))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM