Can I write the RDD data in excel file along with mapping in apache-spark? Is that a correct way? Isn't that a writing will be a local function and can't be passed over the clusters??
Below is given the python code(Its just an example to clarify my question, i understand that this implementation may not be actually required):
import xlsxwriter
import sys
import math
from pyspark import SparkContext
# get the spark context in sc.
workbook = xlsxwriter.Workbook('output_excel.xlsx')
worksheet = workbook.add_worksheet()
data = sc.textFile("xyz.txt")
# xyz.txt is a file whose each line contains string delimited by <SPACE>
row=0
def mapperFunc(x):
for i in range(0,4):
worksheet.write(row, i , x.split(" ")[i])
row++
return len(x.split())
data2 = data.map(mapperFunc)
workbook.close()
There are 2 questioms:
Also If #2 is correct then plz clarify the doubt that I am thinking the worksheet is created in local machine then how does it work?
Thanks
The hadoopoffice library enables you to write Excel files using Spark 1.x via integration of the ExcelOutputFormat (using PairRdd.saveAsNewAPIHadoopFile) or the Spark 2.x data source api. Using this library you can store the files to HDFS or locally or S3 or Azure...
Find some examples here: https://github.com/zuinnote/hadoopoffice
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.