[英]how to convert pyspark rdd into a Dataframe
I have a dataframe df like below我有一个 dataframe df 如下所示
df= df=
+---+---+----+---+---+
| a| b| c| d| e|
+---+---+----+---+---+
| 1| a|foo1| 4| 5|
| 2| b| bar| 4| 6|
| 3| c| mnc| 4| 7|
| 4| c| mnc| 4| 7|
+---+---+----+---+---+
I want to achieve something like df1=我想实现类似 df1=
+---+---+-----------------------------------------------+
| a| b| c |
+---+---+-----------------------------------------------+
| 1| a|{'a': 1, 'b': 'a', 'c': 'foo1', 'd': 4, 'e': 5}|
| 2| b|{'a': 2, 'b': 'b', 'c': 'bar', 'd': 4, 'e': 6} |
| 3| c|{'a': 3, 'b': 'c', 'c': 'mnc', 'd': 4, 'e': 7} |
| 4| c|{'a': 4, 'b': 'c', 'c': 'mnc', 'd': 4, 'e': 7} |
+---+---+-----------------------------------------------+
I really wanted to avoid a group by so i thought first convert the dataframe to rdd and again convert into them one dataframe我真的很想避免一组,所以我想首先将 dataframe 转换为 rdd 并再次转换为 dataframe
The piece of code i have written was我写的那段代码是
df2=df.rdd.flatMap(lambda x:(x.a,x.b,x.asDict()))
while doing a foreach on df2 I am getting the result in rdd format So I tried to create a dataframe out of it.在对 df2 进行 foreach 时,我得到了 rdd 格式的结果所以我尝试从中创建一个 dataframe 。
df3=df2.toDF() #1st way
df3=sparkSession.createDataframe(df2) #2nd way
But I am getting error for both ways.Can someone explain what I am doing wrong here and how to achieve my reuriment但是我在这两种方式上都遇到了错误。有人可以解释我在这里做错了什么以及如何实现我的重生吗
Can be done with spark sql as below:可以用火花 sql 完成,如下所示:
Spark SQL火花 SQL
data.createOrReplaceTempView("data")
spark.sql("""
select a, b, to_json(named_struct('a',a, 'b',b,'c',c,'d',d,'e',e)) as c
from data""").show(20,False)
Output Output
# +---+---+----------------------------------------+
# |a |b |c |
# +---+---+----------------------------------------+
# |1 |a |{"a":1,"b":"a","c":"foo1","d":"4","e":5}|
# |2 |b |{"a":2,"b":"b","c":"bar","d":"4","e":6} |
# |3 |c |{"a":3,"b":"c","c":"mnc","d":"4","e":7} |
# |4 |c |{"a":4,"b":"c","c":"mnc","d":"4","e":7} |
# +---+---+----------------------------------------+
Datframe API数据帧 API
result = data\
.withColumn('c',to_json(struct(data.a,data.b,data.c,data.d,data.e)))\
.select("a","b","c")
result.show(20,False)
Output Output
# +---+---+----------------------------------------+
# |a |b |c |
# +---+---+----------------------------------------+
# |1 |a |{"a":1,"b":"a","c":"foo1","d":"4","e":5}|
# |2 |b |{"a":2,"b":"b","c":"bar","d":"4","e":6} |
# |3 |c |{"a":3,"b":"c","c":"mnc","d":"4","e":7} |
# |4 |c |{"a":4,"b":"c","c":"mnc","d":"4","e":7} |
# +---+---+----------------------------------------+
You can create a json column from a map type column您可以从 map 类型列创建 json 列
import pyspark.sql.functions as F
df = sqlContext.createDataFrame(
[(0, 1, 23, 4, 8, 9, 5, "b1"), (1, 2, 43, 8, 10, 20, 43, "e1")],
("id", "a1", "b1", "c1", "d1", "e1", "f1", "ref")
)
tst = [[F.lit(c),F.col(c)] for c in df.columns]
tst_flat =[item for sublist in tst for item in sublist]
#%%
map_coln = F.create_map(*tst_flat)
df1=df.withColumn("out",F.to_json(map_coln))
result:结果:
In [37]: df1.show(truncate=False)
+---+---+---+---+---+---+---+---+-------------------------------------------------------------------------------+
|id |a1 |b1 |c1 |d1 |e1 |f1 |ref|out |
+---+---+---+---+---+---+---+---+-------------------------------------------------------------------------------+
|0 |1 |23 |4 |8 |9 |5 |b1 |{"id":"0","a1":"1","b1":"23","c1":"4","d1":"8","e1":"9","f1":"5","ref":"b1"} |
|1 |2 |43 |8 |10 |20 |43 |e1 |{"id":"1","a1":"2","b1":"43","c1":"8","d1":"10","e1":"20","f1":"43","ref":"e1"}|
+---+---+---+---+---+---+---+---+-------------------------------------------------------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.