[英]Writing the RDD data in excel file along mapping in apache-spark
Can I write the RDD data in excel file along with mapping in apache-spark? 我可以将excel文件中的RDD数据与apache-spark中的映射一起写入吗? Is that a correct way?
那是正确的方法吗? Isn't that a writing will be a local function and can't be passed over the clusters??
难道不是写作将是局部函数并且不能在群集上传递?
Below is given the python code(Its just an example to clarify my question, i understand that this implementation may not be actually required): 下面给出了python代码(它只是一个示例来阐明我的问题,我知道实际上可能不需要此实现):
import xlsxwriter
import sys
import math
from pyspark import SparkContext
# get the spark context in sc.
workbook = xlsxwriter.Workbook('output_excel.xlsx')
worksheet = workbook.add_worksheet()
data = sc.textFile("xyz.txt")
# xyz.txt is a file whose each line contains string delimited by <SPACE>
row=0
def mapperFunc(x):
for i in range(0,4):
worksheet.write(row, i , x.split(" ")[i])
row++
return len(x.split())
data2 = data.map(mapperFunc)
workbook.close()
There are 2 questioms: 有2个问题:
Also If #2 is correct then plz clarify the doubt that I am thinking the worksheet is created in local machine then how does it work? 另外,如果#2是正确的,那么plz会澄清我在考虑在本地计算机上创建工作表的疑问,那么它如何工作?
Thanks 谢谢
The hadoopoffice library enables you to write Excel files using Spark 1.x via integration of the ExcelOutputFormat (using PairRdd.saveAsNewAPIHadoopFile) or the Spark 2.x data source api. hadoopoffice库使您可以通过集成ExcelOutputFormat(使用PairRdd.saveAsNewAPIHadoopFile)或Spark 2.x数据源api,使用Spark 1.x编写Excel文件。 Using this library you can store the files to HDFS or locally or S3 or Azure...
使用此库,您可以将文件存储到HDFS或本地或S3或Azure ...
Find some examples here: https://github.com/zuinnote/hadoopoffice 在此处找到一些示例: https : //github.com/zuinnote/hadoopoffice
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.