将熊猫数据框转换为Spark数据框时出错

Question

我的熊猫数据框

df4.head()
                     features
 0          [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
 1          [0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...

每个单元格都是一个python列表。

mySchema=StructType([StructField("features",ArrayType(IntegerType()),True)])
sdf2=sqlCtx.createDataFrame(df4,schema=mySchema)

创建Spark Dataframe sdf2时，出现以下错误。 我尝试了不同的数据类型，但是徒劳。

Error: element in array field features: IntegerType can not accept object 0 in type <class 'numpy.int64'>

我想在Pysark中运行BucketedRandomProjectionLSH，它接受带有数据向量的单列。

Answer 1

那是因为数组中有numpy.int64对象。

Spark不接受。

df = pd.DataFrame([
    (np.array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]),),
    (np.array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]),),
], columns = ['features'])

type(df.iloc[0]['features'][0])
> numpy.int64

df = pd.DataFrame([
    ([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],),
    ([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],),
], columns = ['features'])

type(df.iloc[0]['features'][0])
> int

尝试改用Python list 。

将熊猫数据框转换为Spark数据框时出错

问题描述

1 个解决方案

解决方案1
0 2018-06-19 15:52:56

将熊猫数据框转换为Spark数据框时出错

问题描述

1 个解决方案

解决方案1 0 2018-06-19 15:52:56

解决方案1
0 2018-06-19 15:52:56