[英]How do I convert a numpy array to a pyspark dataframe?
I want to convert my results1 numpy array to a dataframe.我想将我的 results1 numpy 数组转换为数据框。 For the record, results1 looks like作为记录, results1 看起来像
array([(1.0, 0.1738578587770462), (1.0, 0.33307021689414978),
(1.0, 0.21377330869436264), (1.0, 0.443511435389518738),
(1.0, 0.3278091162443161), (1.0, 0.041347454154491425)]).
I want to convert the above to a pyspark RDD with columns labeled "limit" (the first value in the tuple) and "probability" (the second value in the tuple).我想将上面的内容转换为一个 pyspark RDD,其中的列标记为“limit”(元组中的第一个值)和“probability”(元组中的第二个值)。
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('YKP').getOrCreate()
sc=spark.sparkContext
# Convert list to RDD
rdd = sc.parallelize(results1)
# Create data frame
df = sc.createDataFrame(rdd)
I keep getting the error我不断收到错误
AttributeError: 'RemoteContext' object has no attribute 'createDataFrame'
when I run this.当我运行这个。 I don't see why this is giving me an error and how do I fix this?我不明白为什么这会给我一个错误,我该如何解决这个问题?
Use map()
and toDF()
instead.使用map()
和toDF()
代替。
import numpy as np
results1 = np.array([(1.0, 0.1738578587770462), (1.0, 0.33307021689414978),
(1.0, 0.21377330869436264), (1.0, 0.443511435389518738),
(1.0, 0.3278091162443161), (1.0, 0.041347454154491425)])
df = sc.parallelize(results1).map(lambda x: [float(i) for i in x])\
.toDF(["limit", "probability"])
df.show()
+-----+--------------------+
|limit| probability|
+-----+--------------------+
| 1.0| 0.1738578587770462|
| 1.0| 0.3330702168941498|
| 1.0| 0.21377330869436265|
| 1.0| 0.44351143538951876|
| 1.0| 0.3278091162443161|
| 1.0|0.041347454154491425|
+-----+--------------------+
The simplest way is:最简单的方法是:
df = rdd.map(lambda x: (x, )).toDF()
df.show()
You can also refer to this post for more details: Create Spark DataFrame.您还可以参考这篇文章了解更多详细信息: 创建 Spark DataFrame。 Can not infer schema for type: <type 'float'> 无法推断类型的架构:<type 'float'>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.