简体   繁体   中英

Writing the RDD data in excel file along mapping in apache-spark

Can I write the RDD data in excel file along with mapping in apache-spark? Is that a correct way? Isn't that a writing will be a local function and can't be passed over the clusters??

Below is given the python code(Its just an example to clarify my question, i understand that this implementation may not be actually required):

import xlsxwriter
import sys
import math
from pyspark import SparkContext

# get the spark context in sc.

workbook = xlsxwriter.Workbook('output_excel.xlsx')
worksheet = workbook.add_worksheet()

data = sc.textFile("xyz.txt")
# xyz.txt is a file whose each line contains string delimited by <SPACE>

row=0

def mapperFunc(x):                  
    for i in range(0,4):      
        worksheet.write(row, i , x.split(" ")[i])
    row++                      
    return len(x.split())   

data2 = data.map(mapperFunc)

workbook.close()

There are 2 questioms:

  1. Is using row in 'mapperFunc' like this is a correct way? Will it increment row each time?
  2. Is writing in the excel file using worksheet.write() in side the mapper function a correct way?

Also If #2 is correct then plz clarify the doubt that I am thinking the worksheet is created in local machine then how does it work?

Thanks

The hadoopoffice library enables you to write Excel files using Spark 1.x via integration of the ExcelOutputFormat (using PairRdd.saveAsNewAPIHadoopFile) or the Spark 2.x data source api. Using this library you can store the files to HDFS or locally or S3 or Azure...

Find some examples here: https://github.com/zuinnote/hadoopoffice

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM