[英]issue in writing data from 2 RDDs (one with unicode data and one with normal )into a csv file in PySpark?
I have two RDD's
: 我有两个RDD's
:
RDD1: data in RDD1
is in unicode format RDD1集:数据RDD1
是Unicode格式
[[u'a',u'b',u'c'],[u'c',u'f',u'a'],[u'ab',u'cd',u'gh']...]
RDD2: RDD2:
[(10.1, 10.0), (23.0, 34.0), (45.0, 23.0),....]
Both the RDDs
have same number of rows (but one has 2 columns/elements in each row/record and one has 3). 这两个RDDs
的行数均相同(但一个RDDs
在每行/记录中具有2列/元素,而一个具有3)。 Now what I want to do is take all elements from RDD2
and 2nd
record from RDD1
and write them out to a csv
file on the local file system (not hdfs) . 现在我想要做的是采取从所有元素RDD2
和2nd
记录从RDD1
,并将它们写出来的csv
本地文件系统(HDFS不)上的文件。 So the output in the csv
file for above sample will be: 因此,以上示例的csv
文件中的输出为:
a,b,c,10.0
c,f,a,34.0
ab,cd,gh,23.0
How can I do that in PySpark
? 我如何在PySpark
做到这PySpark
?
UPDATE: This is my current code: 更新:这是我当前的代码:
columns_num = [0,1,2,4,7]
rdd1 = rdd3.map(lambda row: [row[i] for i in columns_num])
rdd2 = rd.map(lambda tup: (tup[0], tup[1]+ (tup[0]/3)) if tup[0] - tup[1] >= tup[0]/3 else (tup[0],tup[1]))
with open("output.csv", "w") as fw:
writer = csv.writer(fw)
for (r1, r2) in izip(rdd1.toLocalIterator(), rdd2.toLocalIterator()):
writer.writerow(r1 + tuple(r2[1:2]))
I am getting error as TypeError: can only concatenate list (not "tuple") to list
. 我收到错误,因为TypeError: can only concatenate list (not "tuple") to list
。 If I do writer.writerow(tuple(r1) + r2[1:2])
then I get error as UnicodeEncodeError: 'ascii' codec can't encode character u'\\x80' in position 16: ordinal not in range(128)
` 如果我执行writer.writerow(tuple(r1) + r2[1:2])
则会收到UnicodeEncodeError: 'ascii' codec can't encode character u'\\x80' in position 16: ordinal not in range(128)
错误UnicodeEncodeError: 'ascii' codec can't encode character u'\\x80' in position 16: ordinal not in range(128)
`
If by local you mean driver file system then you can simply collect
or convert toLocalIterator
and write: 如果本地表示驱动程序文件系统,则可以简单地collect
或转换为toLocalIterator
并编写:
import csv
import sys
if sys.version_info.major == 2:
from itertools import izip
else:
izip = zip
rdd1 = sc.parallelize([(10.1, 10.0), (23.0, 34.0), (45.0, 23.0)])
rdd2 = sc.parallelize([("a", "b" ," c"), ("c", "f", "a"), ("ab", "cd", "gh")])
with open("output.csv", "w") as fw:
writer = csv.writer(fw)
for (r1, r2) in izip(rdd2.toLocalIterator(), rdd1.toLocalIterator()):
writer.writerow(r1 + r2[1:2])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.