将RDD数据沿apache-spark中的映射写入excel文件中

Question

Can I write the RDD data in excel file along with mapping in apache-spark? 我可以将excel文件中的RDD数据与apache-spark中的映射一起写入吗？ Is that a correct way? 那是正确的方法吗？ Isn't that a writing will be a local function and can't be passed over the clusters?? 难道不是写作将是局部函数并且不能在群集上传递？

Below is given the python code(Its just an example to clarify my question, i understand that this implementation may not be actually required): 下面给出了python代码（它只是一个示例来阐明我的问题，我知道实际上可能不需要此实现）：

import xlsxwriter
import sys
import math
from pyspark import SparkContext

# get the spark context in sc.

workbook = xlsxwriter.Workbook('output_excel.xlsx')
worksheet = workbook.add_worksheet()

data = sc.textFile("xyz.txt")
# xyz.txt is a file whose each line contains string delimited by <SPACE>

row=0

def mapperFunc(x):                  
    for i in range(0,4):      
        worksheet.write(row, i , x.split(" ")[i])
    row++                      
    return len(x.split())   

data2 = data.map(mapperFunc)

workbook.close()

There are 2 questioms: 有2个问题：

Is using row in 'mapperFunc' like this is a correct way? 这样在'mapperFunc'中使用行是正确的方法吗？ Will it increment row each time? 每次都会增加行数吗？
Is writing in the excel file using worksheet.write() in side the mapper function a correct way? 映射器函数旁边使用worksheet.write（）在excel文件中编写是否正确？

Also If #2 is correct then plz clarify the doubt that I am thinking the worksheet is created in local machine then how does it work? 另外，如果＃2是正确的，那么plz会澄清我在考虑在本地计算机上创建工作表的疑问，那么它如何工作？

Thanks 谢谢

Answer 1

The hadoopoffice library enables you to write Excel files using Spark 1.x via integration of the ExcelOutputFormat (using PairRdd.saveAsNewAPIHadoopFile) or the Spark 2.x data source api. hadoopoffice库使您可以通过集成ExcelOutputFormat（使用PairRdd.saveAsNewAPIHadoopFile）或Spark 2.x数据源api，使用Spark 1.x编写Excel文件。 Using this library you can store the files to HDFS or locally or S3 or Azure... 使用此库，您可以将文件存储到HDFS或本地或S3或Azure ...

Find some examples here: https://github.com/zuinnote/hadoopoffice 在此处找到一些示例： https : //github.com/zuinnote/hadoopoffice

将RDD数据沿apache-spark中的映射写入excel文件中

问题描述

1 个解决方案

解决方案1
0 2017-04-21 22:35:05

将RDD数据沿apache-spark中的映射写入excel文件中

问题描述

1 个解决方案

解决方案1 0 2017-04-21 22:35:05

解决方案1
0 2017-04-21 22:35:05