简体   繁体   English

在pyspark中使用arraytype列创建数据框

[英]Create dataframe with arraytype column in pyspark

I am trying to create a new dataframe with ArrayType() column, I tried with and without defining schema but couldn't get the desired result.我正在尝试使用ArrayType()列创建一个新的数据ArrayType() ,我尝试使用和不使用定义架构,但无法获得所需的结果。 My code below with schema我下面的代码带有架构

from pyspark.sql.types import *
l = [[1,2,3],[3,2,4],[6,8,9]]
schema = StructType([
  StructField("data", ArrayType(IntegerType()), True)
])
df = spark.createDataFrame(l,schema)
df.show(truncate = False)

This gives error:这给出了错误:

ValueError: Length of object (3) does not match with length of fields (1) ValueError:对象长度 (3) 与字段长度 (1) 不匹配

Desired output:期望的输出:

+---------+
|data     |
+---------+
|[1,2,3]  |
|[3,2,4]  |
|[6,8,9]  |
+---------+

Edit:编辑:

I found a strange thing(atleast for me):我发现了一件奇怪的事情(至少对我而言):

if we use the following code, it gives the expected result:如果我们使用以下代码,它会给出预期的结果:

import pyspark.sql.functions as f
data = [
    ('person', ['john', 'sam', 'jane']),
    ('pet', ['whiskers', 'rover', 'fido'])
]

df = spark.createDataFrame(data, ["type", "names"])
df.show(truncate=False)

This gives the following expected output:这给出了以下预期输出:

+------+-----------------------+
|type  |names                  |
+------+-----------------------+
|person|[john, sam, jane]      |
|pet   |[whiskers, rover, fido]|
+------+-----------------------+

But if we remove the first column, then it gives unexpected result.但是,如果我们删除第一列,则会产生意想不到的结果。

import pyspark.sql.functions as f
data = [
    (['john', 'sam', 'jane']),
    (['whiskers', 'rover', 'fido'])
]

df = spark.createDataFrame(data, ["names"])
df.show(truncate=False)

This gives the following output:这给出了以下输出:

+--------+-----+----+
|names   |_2   |_3  |
+--------+-----+----+
|john    |sam  |jane|
|whiskers|rover|fido|
+--------+-----+----+

I think you already have the answer to your question.我想你已经有了问题的答案。 Another solution is:另一种解决方案是:

>>> l = [([1,2,3],), ([3,2,4],),([6,8,9],)]
>>> df = spark.createDataFrame(l, ['data'])
>>> df.show()

+---------+
|     data|
+---------+
|[1, 2, 3]|
|[3, 2, 4]|
|[6, 8, 9]|
+---------+

or或者

>>> from pyspark.sql.functions import array

>>> l = [[1,2,3],[3,2,4],[6,8,9]]
>>> df = spark.createDataFrame(l)
>>> df = df.withColumn('data',array(df.columns))
>>> df = df.select('data')
>>> df.show()
+---------+
|     data|
+---------+
|[1, 2, 3]|
|[3, 2, 4]|
|[6, 8, 9]|
+---------+

Regarding the strange thing, it is not that strange but you need to keep in mind that the tuple with a single value is the single value itself关于奇怪的事情,这并不奇怪,但您需要记住,具有单个值的元组就是单个值本身

>>> (['john', 'sam', 'jane'])
['john', 'sam', 'jane']

>>> type((['john', 'sam', 'jane']))
<class 'list'>

so the createDataFrame sees a list not the tuple.所以createDataFrame看到的是一个列表而不是元组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Pyspark 数据框中连接 2 列轴 = 1 上的 ArrayType? - How to Concat 2 column of ArrayType on axis = 1 in Pyspark dataframe? 将带有 StringType 列表的 PySpark DataFrame 列转换为 ArrayType - Convert PySpark DataFrame column with list in StringType to ArrayType 在 PySpark 中将 StringType 列转换为 ArrayType - Convert StringType Column To ArrayType In PySpark pyspark - 使用 ArrayType 列折叠和求和 - pyspark - fold and sum with ArrayType column 将 PySpark DataFrame ArrayType 字段合并为单个 ArrayType 字段 - Combine PySpark DataFrame ArrayType fields into single ArrayType field 从一个 PySpark 数据框中获取 ArrayType 列并在另一个数据框中获取相应的值 - Take ArrayType column from one PySpark dataframe and get corresponding value in another dataframe 如何在PySpark DataFrame中将ArrayType转换为DenseVector? - How to convert ArrayType to DenseVector in PySpark DataFrame? 将字符串转换为 ArrayType(DoubleType) pyspark dataframe - Casting string to ArrayType(DoubleType) pyspark dataframe pySpark:如何在 dataframe 的 arrayType 列中获取 structType 中的所有元素名称? - pySpark: How can I get all element names in structType in arrayType column in a dataframe? pyspark:通过 ArrayType 列过滤和提取结构 - pyspark: filtering and extract struct through ArrayType column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM