简体   繁体   English

如何从另一个数据框中的dict创建数据框?

[英]How to create dataframe from dict in another dataframe?

I'm have a column of spark-dataframe我有一列spark-dataframe
Output from df.select('parsed').show() : df.select('parsed').show()

+--------------------+
|              parsed|
+--------------------+
|{Action Flags=I, ...|
|{Action Flags=I, ...|
|{Action Flags=I, ...|
|{Action Flags=I, ...|
+--------------------+

All elements of this column is dict.此列的所有元素都是字典。
How I can made new spark-dataframe from dicts using keys as column names?如何使用键作为列名从 dicts 创建新的spark-dataframe

Before converting columns from a column having dict as values, you must know about its keys.在从具有 dict 作为值的列转换列之前,您必须了解它的键。 So can label columns.所以可以标记列。 Below i creating sample dataframe and then converting dict keys to column.下面我创建sample dataframe ,然后将字典键转换为列。

df = sqlContext.createDataFrame([
     [{'a':1,'b':2, 'c': 3}],
     [{'a':1,'b':2, 'c': 3}],
     [{'a':1,'b':2, 'c': 3}]], ["col"]
)
df.show(truncate=False)
+---------------------------+
|col                        |
+---------------------------+
|Map(b -> 2, c -> 3, a -> 1)|
|Map(b -> 2, c -> 3, a -> 1)|
|Map(b -> 2, c -> 3, a -> 1)|
+---------------------------+

After creating sample dataframe lets get first row from it -创建示例数据框后,让我们从中获取第一行 -

first_row = df.first()['col'] #select column which have dict as values 
print (first_row)
{u'a': 1, u'b': 2, u'c': 3}

Now we have values from first row and also dict column values, extract keys from it so we can create column from it -现在我们有第一行的值和 dict 列值,从中提取键,以便我们可以从中创建列 -

columns = first_row.keys()
print (columns)
[u'a', u'c', u'b']

After this loop over column list and select these as column from dict column -在此循环列列表之后并从字典列中选择这些作为列 -

from pyspark.sql import functions as F
col_list = [F.col("col").getItem(col).alias(col) for col in columns]
df.select(col_list).show()
+---+---+---+
|  a|  c|  b|
+---+---+---+
|  1|  3|  2|
|  1|  3|  2|
|  1|  3|  2|
+---+---+---+

There are others ways to do this also.还有其他方法可以做到这一点。 Above i have mentioned one way, below is second by creating new column with withColumn -上面我提到了一种方法,下面是第二种方法,使用withColumn创建新列 -

for cl in columns: #already created columns variable
    df = df.withColumn(cl, F.col("col").getItem(cl))
df.show(truncate=False)

+---------------------------+---+---+---+
|col                        |a  |c  |b  |
+---------------------------+---+---+---+
|Map(b -> 2, c -> 3, a -> 1)|1  |3  |2  |
|Map(b -> 2, c -> 3, a -> 1)|1  |3  |2  |
|Map(b -> 2, c -> 3, a -> 1)|1  |3  |2  |
+---------------------------+---+---+---+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM