Error while converting a pandas dataframe to spark Dataframe

Question

My Pandas DataFrame

df4.head()
                     features
 0          [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
 1          [0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...

Each cell is a python list.

mySchema=StructType([StructField("features",ArrayType(IntegerType()),True)])
sdf2=sqlCtx.createDataFrame(df4,schema=mySchema)

While creating spark Dataframe sdf2, I am getting following error. I tried with different datatypes but in vain.

Error: element in array field features: IntegerType can not accept object 0 in type <class 'numpy.int64'>

I want to run BucketedRandomProjectionLSH in Pysark which accepts a single column with data vector.

Answer 1

That is because you have numpy.int64 objects inside your arrays.

Spark does not accept that.

df = pd.DataFrame([
    (np.array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]),),
    (np.array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]),),
], columns = ['features'])

type(df.iloc[0]['features'][0])
> numpy.int64

df = pd.DataFrame([
    ([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],),
    ([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],),
], columns = ['features'])

type(df.iloc[0]['features'][0])
> int

Try using a Python list instead.

Error while converting a pandas dataframe to spark Dataframe

Question

1 answers

solution1
0 2018-06-19 15:52:56

Error while converting a pandas dataframe to spark Dataframe

Question

1 answers

solution1 0 2018-06-19 15:52:56

solution1
0 2018-06-19 15:52:56