[英]Spark:How to turn tuple into DataFrame
I have train_rdd
like (('a',1),('b',2),('c',3))
.我有
train_rdd
喜欢(('a',1),('b',2),('c',3))
。 I use the following way to turn it into DataFrame我用下面的方法把它变成DataFrame
from pyspark.sql import Row
train_label_df = train_rdd.map(lambda x: (Row(**dict(x)))).toDF()
But maybe some keys is missing in some RDDS.但可能某些 RDDS 中缺少某些键。 So errors occurs.
所以会出现错误。
File
"/mnt/hadoop/yarn/local/usercache/hdfs/appcache/application_/container_05_000017/pyspark.zip/pyspark/worker.py", line
253, in main
process()
File
"/mnt/hadoop/yarn/local/usercache/hdfs/appcache/application_/container_05_000017/pyspark.zip/pyspark/worker.py", line
248, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File
"/mnt/hadoop/yarn/local/usercache/hdfs/appcache/application_/container_05_000002/pyspark.zip/pyspark/rdd.py", line
2440, in pipeline_func
File
"/mnt/hadoop/yarn/local/usercache/hdfs/appcache/application_/container_05_000002/pyspark.zip/pyspark/rdd.py", line
2440, in pipeline_func
File
"/mnt/hadoop/yarn/local/usercache/hdfs/appcache/application_/container_05_000002/pyspark.zip/pyspark/rdd.py", line
350, in func
File
"/mnt/hadoop/yarn/local/usercache/hdfs/appcache/application_/container_05_000002/pyspark.zip/pyspark/rdd.py", line
1859, in combineLocally
File
"/mnt/hadoop/yarn/local/usercache/hdfs/appcache/application_/container_05_000017/pyspark.zip/pyspark/shuffle.py", line
237, in mergeValues
for k, v in iterator:
TypeError: cannot unpack non - iterable NoneType object
Any other way to convert tuple type RDD to DataFrame?将元组类型 RDD 转换为 DataFrame 的任何其他方式?
update:更新:
I also try to use createDataFrame
.我也尝试使用
createDataFrame
。
rdd = sc.parallelize([('a',1), (('a',1), ('b',2)), (('a',1), ('b',2), ('c',3) ) ])
schema = StructType([
StructField("a", StringType(), True),
StructField("b", StringType(), True),
StructField("c", StringType(), True),
])
train_label_df = sqlContext.createDataFrame(rdd, schema)
train_label_df.show()
An error occurs.发生错误。
File "/home/spark/python/pyspark/sql/types.py", line 1400, in verify_struct
"length of fields (%d)" % (len(obj), len(verifiers))))
ValueError: Length of object (2) does not match with length of fields (3)
You can map the tuples into a dict:您可以将 map 元组转换为字典:
rdd1 = rdd.map(lambda x: dict(x if isinstance(x[0],tuple) else [x]))
and then do one of the following:然后执行以下操作之一:
from pyspark.sql import Row
cols = ["a", "b", "c"]
rdd1.map(lambda x: Row(**{c:x.get(c) for c in cols})).toDF().show()
+---+----+----+
| a| b| c|
+---+----+----+
| 1|null|null|
| 1| 2|null|
| 1| 2| 3|
+---+----+----+
or或者
rdd1.map(lambda x: tuple(x.get(c) for c in cols)).toDF(cols).show()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.