简体   繁体   English

将RDD数据沿apache-spark中的映射写入excel文件中

[英]Writing the RDD data in excel file along mapping in apache-spark

Can I write the RDD data in excel file along with mapping in apache-spark? 我可以将excel文件中的RDD数据与apache-spark中的映射一起写入吗? Is that a correct way? 那是正确的方法吗? Isn't that a writing will be a local function and can't be passed over the clusters?? 难道不是写作将是局部函数并且不能在群集上传递?

Below is given the python code(Its just an example to clarify my question, i understand that this implementation may not be actually required): 下面给出了python代码(它只是一个示例来阐明我的问题,我知道实际上可能不需要此实现):

import xlsxwriter
import sys
import math
from pyspark import SparkContext

# get the spark context in sc.

workbook = xlsxwriter.Workbook('output_excel.xlsx')
worksheet = workbook.add_worksheet()

data = sc.textFile("xyz.txt")
# xyz.txt is a file whose each line contains string delimited by <SPACE>

row=0

def mapperFunc(x):                  
    for i in range(0,4):      
        worksheet.write(row, i , x.split(" ")[i])
    row++                      
    return len(x.split())   

data2 = data.map(mapperFunc)

workbook.close()

There are 2 questioms: 有2个问题:

  1. Is using row in 'mapperFunc' like this is a correct way? 这样在'mapperFunc'中使用行是正确的方法吗? Will it increment row each time? 每次都会增加行数吗?
  2. Is writing in the excel file using worksheet.write() in side the mapper function a correct way? 映射器函数旁边使用worksheet.write()在excel文件中编写是否正确?

Also If #2 is correct then plz clarify the doubt that I am thinking the worksheet is created in local machine then how does it work? 另外,如果#2是正确的,那么plz会澄清我在考虑在本地计算机上创建工作表的疑问,那么它如何工作?

Thanks 谢谢

The hadoopoffice library enables you to write Excel files using Spark 1.x via integration of the ExcelOutputFormat (using PairRdd.saveAsNewAPIHadoopFile) or the Spark 2.x data source api. hadoopoffice库使您可以通过集成ExcelOutputFormat(使用PairRdd.saveAsNewAPIHadoopFile)或Spark 2.x数据源api,使用Spark 1.x编写Excel文件。 Using this library you can store the files to HDFS or locally or S3 or Azure... 使用此库,您可以将文件存储到HDFS或本地或S3或Azure ...

Find some examples here: https://github.com/zuinnote/hadoopoffice 在此处找到一些示例: https : //github.com/zuinnote/hadoopoffice

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM