Pyspark：如何创建只有一行的数据框？

Question

What I am trying to do seems to be quite simple.我想要做的似乎很简单。 I need to create a dataframe with a single column and a single value.我需要创建一个具有单列和单个值的数据框。

I have tried a few approaches, namely:我尝试了几种方法，即：

Creation of empty dataframe and appending the data afterwards:创建空数据框并在之后附加数据：

project_id = 'PC0000000042'
schema = T.StructType([T.StructField("ProjectId", T.StringType(), True)])
empty_df = sqlContext.createDataFrame(sc.emptyRDD(), schema)

rdd = sc.parallelize([(project_id)])
df_temp = spark.createDataFrame(rdd, SCHEMA)
df = empty_df.union(df_temp)

Creation of dataframe based on this one value.基于这个值创建数据框。

rdd = sc.parallelize([(project_id)])
df = spark.createDataFrame(rdd, schema)

However, what I get in both cases is:但是，我在这两种情况下得到的是：

TypeError: StructType can not accept object 'PC0000000042' in type <class 'str'>

Which I don't quite understand since the type seems to be correct.我不太明白，因为类型似乎是正确的。 Thank you for any advice!感谢您的任何建议！

Answer 1

One small change.一个小小的改变。 If you have project_id = 'PC0000000042' , then如果你有project_id = 'PC0000000042' ，那么
rdd = sc.parallelize([[project_id]])

You should pass the data as a list of list: [['PC0000000042']] instead of ['PC0000000042'] .您应该将数据作为列表列表传递： [['PC0000000042']]而不是['PC0000000042'] 。

If you have 2 rows, then:如果您有 2 行，则：

project_id = [['PC0000000042'], ['PC0000000043']]
rdd = sc.parallelize(project_id)
spark.createDataFrame(rdd, schema).show()

+------------+
|   ProjectId|
+------------+
|PC0000000042|
|PC0000000043|
+------------+

Without RDDs , you can also do:没有RDDs ，您还可以执行以下操作：

project_id = [['PC0000000042']]
spark.createDataFrame(project_id,schema=schema).show()

Pyspark：如何创建只有一行的数据框？

问题描述

1 个解决方案

解决方案1
0 2020-10-06 17:17:12

Pyspark：如何创建只有一行的数据框？

问题描述

1 个解决方案

解决方案1 0 2020-10-06 17:17:12

解决方案1
0 2020-10-06 17:17:12