My Pandas DataFrame
df4.head()
features
0 [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
1 [0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
Each cell is a python list.
mySchema=StructType([StructField("features",ArrayType(IntegerType()),True)])
sdf2=sqlCtx.createDataFrame(df4,schema=mySchema)
While creating spark Dataframe sdf2, I am getting following error. I tried with different datatypes but in vain.
Error: element in array field features: IntegerType can not accept object 0 in type <class 'numpy.int64'>
I want to run BucketedRandomProjectionLSH in Pysark which accepts a single column with data vector.
That is because you have numpy.int64
objects inside your arrays.
Spark does not accept that.
df = pd.DataFrame([
(np.array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]),),
(np.array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]),),
], columns = ['features'])
type(df.iloc[0]['features'][0])
> numpy.int64
df = pd.DataFrame([
([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],),
([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],),
], columns = ['features'])
type(df.iloc[0]['features'][0])
> int
Try using a Python list
instead.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.