无法从列表中创建 dataframe：pyspark

Question

I have a list that is generated by a function.我有一个由 function 生成的列表。 when I execute print on my list:当我在我的列表上执行print时：

print(preds_labels)

I obtain:我得到：

[(0.,8.),(0.,13.),(0.,19.),(0.,19.),(0.,19.),(0.,19.),(0.,19.),(0.,20.),(0.,21.),(0.,23.)]

but when I want to create a DataFrame with this command:但是当我想用这个命令创建一个DataFrame时：

df = sqlContext.createDataFrame(preds_labels, ["prediction", "label"])

I get an error message:我收到一条错误消息：

not supported type: type 'numpy.float64'不支持的类型：类型“numpy.float64”

If I create the list manually, I have no problem.如果我手动创建列表，我没有问题。 Do you have an idea?你有想法吗？

Answer 1

pyspark uses its own type system and unfortunately it doesn't deal with numpy well. pyspark 使用自己的类型系统，不幸的是它不能很好地处理 numpy。 It works with python types though.不过，它适用于 python 类型。 So you could manually convert the numpy.float64 to float like所以你可以手动将numpy.float64转换为float类的

df = sqlContext.createDataFrame(
    [(float(tup[0]), float(tup[1]) for tup in preds_labels], 
    ["prediction", "label"]
)

Note pyspark will then take them as pyspark.sql.types.DoubleType注意 pyspark 会将它们作为pyspark.sql.types.DoubleType

Answer 2

To anyone arriving here with the error:对于任何因错误到达这里的人：

typeerror not supported type class 'numpy.str_'

This is true for string as well.字符串也是如此。 So if you created your list strings using numpy, try to change it to pure python.因此，如果您使用 numpy 创建列表字符串，请尝试将其更改为纯 python。 Create list of single item repeated N times 创建重复N次的单项列表

无法从列表中创建 dataframe：pyspark

问题描述

2 个解决方案

解决方案1
15 已采纳 2016-08-08 10:45:39

解决方案2
0 2022-09-16 01:38:37

无法从列表中创建 dataframe：pyspark

问题描述

2 个解决方案

解决方案1 15 已采纳 2016-08-08 10:45:39

解决方案2 0 2022-09-16 01:38:37

解决方案1
15 已采纳 2016-08-08 10:45:39

解决方案2
0 2022-09-16 01:38:37