I'm using pyspark and imported a hive table into a dataframe.
df = sqlContext.sql("from hive_table select *")
I need help on converting this df to numpy array. You may assume hive_table has only one column.
Can you please suggest? Thank you in advance.
You can:
sqlContext.range(0, 10).toPandas().values # .reshape(-1) for 1d array
array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]])
but it is unlikely you really want to. Created array
will be local to the driver node so it its rarely useful. If you're looking for some variant of distributed array-like data structure there is a number of possible choices in Apache Spark:
pyspark.mllib.linalg.distributed
which provides a number of distributed matrix classes. sparkit-learn
ArrayRDD
. and independent of Apache Spark:
import numpy as np
df.select("column1", "column2", "column3")
data_array = np.array(df.select("column1", "column2", "column3"))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.