在Spark API中有多少種方法可以將新列添加到數據框RDD中？

Question

我只能想到一個使用withColumn（）：

val df = sc.dataFrame.withColumn('newcolname',{ lambda row: row + 1 } )

但是我如何將其概括為文本數據？ 比如我的DataFrame了

strning的值說“這是一個字符串的例子”，我想提取

val arraytring中的第一個和最后一個字：Array [String] = Array（first，last）

Answer 1

這是你要找的東西嗎？

val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)

import sqlContext.implicits._

val extractFirstWord = udf((sentence: String) => sentence.split(" ").head)
val extractLastWord = udf((sentence: String) => sentence.split(" ").reverse.head)

val sentences = sc.parallelize(Seq("This is an example", "And this is another one", "One_word", "")).toDF("sentence")
val splits = sentences
             .withColumn("first_word", extractFirstWord(col("sentence")))
             .withColumn("last_word", extractLastWord(col("sentence")))

splits.show()

然后輸出是：

+--------------------+----------+---------+
|            sentence|first_word|last_word|
+--------------------+----------+---------+
|  This is an example|      This|  example|
|And this is anoth...|       And|      one|
|            One_word|  One_word| One_word|
|                    |          |         |
+--------------------+----------+---------+

Answer 2

# Create a simple DataFrame, stored into a partition directory
df1 = sqlContext.createDataFrame(sc.parallelize(range(1, 6))\
                                   .map(lambda i: Row(single=i, double=i * 2)))
df1.save("data/test_table/key=1", "parquet")

# Create another DataFrame in a new partition directory,
# adding a new column and dropping an existing column
df2 = sqlContext.createDataFrame(sc.parallelize(range(6, 11))
                                   .map(lambda i: Row(single=i, triple=i * 3)))
df2.save("data/test_table/key=2", "parquet")

# Read the partitioned table
df3 = sqlContext.parquetFile("data/test_table")
df3.printSchema()

https://spark.apache.org/docs/1.3.1/sql-programming-guide.html

在Spark API中有多少種方法可以將新列添加到數據框RDD中？

問題描述

2 個解決方案

解決方案1
2 已采納 2016-05-17 10:56:28

解決方案2
1 2016-05-17 07:48:21

在Spark API中有多少種方法可以將新列添加到數據框RDD中？

問題描述

2 個解決方案

解決方案1 2 已采納 2016-05-17 10:56:28

解決方案2 1 2016-05-17 07:48:21

解決方案1
2 已采納 2016-05-17 10:56:28

解決方案2
1 2016-05-17 07:48:21