[英]PySpark - Create DataFrame from Numpy Matrix
I have a numpy matrix:我有一个 numpy 矩阵:
arr = np.array([[2,3], [2,8], [2,3],[4,5]])
I need to create a PySpark Dataframe from arr
.我需要从
arr
创建一个 PySpark Dataframe。 I can not manually input the values because the length/values of arr
will be changing dynamically so I need to convert arr
into a dataframe.我无法手动输入值,因为
arr
的长度/值会动态变化,所以我需要将arr
转换为数据帧。
I tried the following code to no success.我尝试了以下代码但没有成功。
df= sqlContext.createDataFrame(arr,["A", "B"])
However, I get the following error.但是,我收到以下错误。
TypeError: Can not infer schema for type: <type 'numpy.ndarray'>
Hope this helps!希望这有帮助!
import numpy as np
#sample data
arr = np.array([[2,3], [2,8], [2,3],[4,5]])
rdd1 = sc.parallelize(arr)
rdd2 = rdd1.map(lambda x: [int(i) for i in x])
df = rdd2.toDF(["A", "B"])
df.show()
Output is:输出是:
+---+---+
| A| B|
+---+---+
| 2| 3|
| 2| 8|
| 2| 3|
| 4| 5|
+---+---+
No need to use the RDD API.无需使用 RDD API。 Simply:
简单地说:
mat = np.random.random((10,3))
cols = ["ColA","ColB","ColC"]
df = spark.createDataFrame(mat.tolist(), cols)
df.show()
import numpy as np
from pyspark.ml.linalg import Vectors
arr = np.array([[2,3], [2,8], [2,3],[4,5]])
df = np.concatenate(arr).reshape(1000,-1)
dff = map(lambda x: (int(x[0]), Vectors.dense(x[1:])), df)
mydf = spark.createDataFrame(dff,schema=["label", "features"])
mydf.show(5)
Try this will work..试试这个会奏效..
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.