[英]Pyspark: how to create a dataframe with only one row?
What I am trying to do seems to be quite simple.我想要做的似乎很简单。 I need to create a dataframe with a single column and a single value.我需要创建一个具有单列和单个值的数据框。
I have tried a few approaches, namely:我尝试了几种方法,即:
Creation of empty dataframe and appending the data afterwards:创建空数据框并在之后附加数据:
project_id = 'PC0000000042'
schema = T.StructType([T.StructField("ProjectId", T.StringType(), True)])
empty_df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
rdd = sc.parallelize([(project_id)])
df_temp = spark.createDataFrame(rdd, SCHEMA)
df = empty_df.union(df_temp)
Creation of dataframe based on this one value.基于这个值创建数据框。
rdd = sc.parallelize([(project_id)])
df = spark.createDataFrame(rdd, schema)
However, what I get in both cases is:但是,我在这两种情况下得到的是:
TypeError: StructType can not accept object 'PC0000000042' in type <class 'str'>
Which I don't quite understand since the type seems to be correct.我不太明白,因为类型似乎是正确的。 Thank you for any advice!感谢您的任何建议!
One small change.一个小小的改变。 If you have project_id = 'PC0000000042'
, then如果你有project_id = 'PC0000000042'
,那么rdd = sc.parallelize([[project_id]])
You should pass the data as a list of list: [['PC0000000042']]
instead of ['PC0000000042']
.您应该将数据作为列表列表传递: [['PC0000000042']]
而不是['PC0000000042']
。
If you have 2 rows, then:如果您有 2 行,则:
project_id = [['PC0000000042'], ['PC0000000043']]
rdd = sc.parallelize(project_id)
spark.createDataFrame(rdd, schema).show()
+------------+
| ProjectId|
+------------+
|PC0000000042|
|PC0000000043|
+------------+
Without RDDs
, you can also do:没有RDDs
,您还可以执行以下操作:
project_id = [['PC0000000042']]
spark.createDataFrame(project_id,schema=schema).show()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.