如何在Spark SQL中從列表創建數據框？

Question

Spark版本：2.1

例如，在pyspark中，我創建一個列表

test_list = [['Hello', 'world'], ['I', 'am', 'fine']]

那么如何從test_list創建一個數據框，其中數據框的類型如下所示：

DataFrame[words: array<string>]

Answer 1

這是怎么做的 -

from pyspark.sql.types import *

cSchema = StructType([StructField("WordList", ArrayType(StringType()))])

# notice extra square brackets around each element of list 
test_list = [['Hello', 'world']], [['I', 'am', 'fine']]

df = spark.createDataFrame(test_list,schema=cSchema)

Answer 2

我不得不使用多個列和類型 - 下面的示例有一個字符串列和一個整數列。 對Pushkr代碼的略微調整（上圖）給出：

from pyspark.sql.types import *

cSchema = StructType([StructField("Words", StringType())\
                      ,StructField("total", IntegerType())])

test_list = [['Hello', 1], ['I am fine', 3]]

df = spark.createDataFrame(test_list,schema=cSchema)

輸出：

 df.show()
 +---------+-----+
|    Words|total|
+---------+-----+
|    Hello|    1|
|I am fine|    3|
+---------+-----+

Answer 3

您應該使用Row對象列表（[Row]）來創建數據框。

from pyspark.sql import Row

spark.createDataFrame(list(map(lambda x: Row(words=x), test_list)))

Answer 4

   You can create a RDD first from the input and then convert to dataframe from the constructed RDD
   <code>  
     import sqlContext.implicits._
       val testList = Array(Array("Hello", "world"), Array("I", "am", "fine"))
       // CREATE RDD
       val testListRDD = sc.parallelize(testList)
     val flatTestListRDD = testListRDD.flatMap(entry => entry)
     // COnvert RDD to DF 
     val testListDF = flatTestListRDD.toDF
     testListDF.show
    </code>

如何在Spark SQL中從列表創建數據框？

問題描述

4 個解決方案

解決方案1
22 已采納 2017-04-17 04:27:01

解決方案2
9 2018-06-21 13:38:05

解決方案3
3 2018-06-21 19:19:10

解決方案4
-3 2017-04-17 04:16:25

如何在Spark SQL中從列表創建數據框？

問題描述

4 個解決方案

解決方案1 22 已采納 2017-04-17 04:27:01

解決方案2 9 2018-06-21 13:38:05

解決方案3 3 2018-06-21 19:19:10

解決方案4 -3 2017-04-17 04:16:25

解決方案1
22 已采納 2017-04-17 04:27:01

解決方案2
9 2018-06-21 13:38:05

解決方案3
3 2018-06-21 19:19:10

解決方案4
-3 2017-04-17 04:16:25