简体   繁体   English

Pyspark / Dataframe:添加新列,将嵌套列表保留为嵌套列表

[英]Pyspark / Dataframe: Add new column that keeps nested list as nested list

I have a basic question about dataframes and adding a column that should contain a nested list. 我有一个关于数据框的基本问题,并添加了一个应包含嵌套列表的列。 This is basically the problem: 这基本上是问题所在:

b = [[['url.de'],['name']],[['url2.de'],['name2']]]

a = sc.parallelize(b)
a = a.map(lambda p: Row(URL=p[0],name=p[1]))
df = sqlContext.createDataFrame(a)

list1 = [[['a','s', 'o'],['hallo','ti']],[['a','s', 'o'],['hallo','ti']]]
c = [b[0] + [list1[0]],b[1] + [list1[1]]]

#Output looks like this:
[[['url.de'], ['name'], [['a', 's', 'o'], ['hallo', 'ti']]], 
 [['url2.de'], ['name2'], [['a', 's', 'o'], ['hallo', 'ti']]]]

To Create a new Dataframe from this output, I´m trying to create a new schema: 要从此输出创建一个新的数据框,我试图创建一个新的架构:

schema = df.withColumn('NewColumn',array(lit("10"))).schema

I then use it to create the new DataFrame: 然后,我用它来创建新的DataFrame:

df = sqlContext.createDataFrame(c,schema)
df.map(lambda x: x).collect()

#Output
[Row(URL=[u'url.de'], name=[u'name'], NewColumn=[u'[a, s, o]', u'[hallo, ti]']),
 Row(URL=[u'url2.de'], name=[u'name2'], NewColumn=[u'[a, s, o]', u'[hallo, ti]'])]

The Problem now is that, the nested list was transformed into a list with two unicode entries instead of keeping the original format. 现在的问题是,嵌套列表已转换为具有两个unicode条目的列表,而不是保留原始格式。

I think this is due to my definition of the new Column "... array(lit("10"))". 我认为这是由于我对新列“ ... array(lit(“ 10”))”的定义所致。

What do I have to use in order to keep the original format? 我必须使用什么才能保留原始格式?

You can directly inspect the schema of the dataframe by calling df.schema . 您可以通过调用df.schema直接检查数据df.schema的架构。 You can see that in the given scenario we have the following: 您可以看到,在给定方案中,我们具有以下内容:

StructType(
  List(
    StructField(URL,ArrayType(StringType,true),true),
    StructField(name,ArrayType(StringType,true),true),
    StructField(NewColumn,ArrayType(StringType,false),false)
  )
)

The NewColumn that you added is an ArrayType column whose entries are all StringType . 您添加的NewColumn是一个ArrayType列,其条目均为StringType So anything that is contained in the array will be converted to a string, even if it is itself an array. 因此,数组中包含的任何内容都将转换为字符串,即使它本身就是数组。 If you want to have nested arrays (2 layers), then you need to change your schema so that the the NewColumn field has an ArrayType(ArrayType(StringType,False),False) type. 如果要嵌套数组(2层),则需要更改架构,以使NewColumn字段具有ArrayType(ArrayType(StringType,False),False) NewColumn ArrayType(ArrayType(StringType,False),False)类型。 You can do this by explicitly defining the schema: 您可以通过显式定义架构来做到这一点:

from pyspark.sql.types import StructType, StructField, ArrayType, StringType

schema = StructType([
    StructField("URL", ArrayType(StringType(),True), True),
    StructField("name", ArrayType(StringType(),True), True),
    StructField("NewColumn", ArrayType(ArrayType(StringType(),False),False), False)])

Or by changing your code by having the NewColumn be defined by nesting the array function, array(array()) : 或者通过嵌套array函数array(array())来定义NewColumn来更改代码:

df.withColumn('NewColumn',array(array(lit("10")))).schema

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM