简体   繁体   中英

how to convert or save a csv file into a txt file using pyspark

I'm learning Pyspark and I don't know how to save the sum of RDD values into a file. I've tried the code below unsuccessfully:

from typing import KeysView

counts = rdd.flatMap(lambda line: line.split(",")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

k=counts.keys().saveAsTextFile("out/out_1_2a.txt")
sc.parallelize(counts.values().sum()).saveAsTextFile('out/out_1_3.txt')

While I could save the keys into a file, I couldn't save the sum of the values. The error I get is: "TypeError: 'int' object is not iterable"

Can someone help:

See logic below -

counts = rdd.flatMap(lambda line: line.split(",")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

cnt_sum = counts.values().sum()

sc.parallelize([cnt_sum]).coalesce(1).saveAsTextFile("<path>/filename.txt")

More effective (less code):

count = len(rdd.flatMap(lambda x: x.split(",")).collect())
sc.parallelize([count]).coalesce(1).saveAsTextFile("<path>/filename.txt")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM