[英]Spark MLLib unable to write out to S3 : path already exists
I have data in a S3 bucket in directory /data/vw/
. 我在目录
/data/vw/
的S3存储桶中有数据。 Each line is of the form: 每行的格式为:
| abc:2 def:1 ghi:3 ...
I want to convert it to the following format: 我想将其转换为以下格式:
abc abc def ghi ghi ghi
The new converted lines should go to S3 in directory /data/spark
转换后的新行应转到目录
/data/spark
S3。
Basically, repeat each string the number of times that follows the colon. 基本上,将每个字符串重复冒号之后的次数。 I am trying to convert a VW LDA input file to a corresponding file for consumption by Spark's LDA library.
我正在尝试将大众LDA输入文件转换为相应的文件,以供Spark的LDA库使用。
The code: 编码:
import org.apache.spark.{SparkConf, SparkContext}
object Vw2SparkLdaFormatConverter {
def repeater(s: String): String = {
val ssplit = s.split(':')
(ssplit(0) + ' ') * ssplit(1).toInt
}
def main(args: Array[String]) {
val inputPath = args(0)
val outputPath = args(1)
val conf = new SparkConf().setAppName("FormatConverter")
val sc = new SparkContext(conf)
val vwdata = sc.textFile(inputPath)
val sparkdata = vwdata.map(s => s.trim().split(' ').map(repeater).mkString)
val coalescedSparkData = sparkdata.coalesce(100)
coalescedSparkData.saveAsTextFile(outputPath)
sc.stop()
}
}
When I run this (as a Spark EMR job in AWS), the step fails with exception: 当我运行此文件(作为AWS中的Spark EMR作业)时,该步骤将失败,并出现以下异常:
18/01/20 00:16:28 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at ...
The code is run as: 该代码运行为:
spark-submit --class Vw2SparkLdaFormatConverter --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=true --executor-memory 4g s3a://mybucket/scripts/myscalajar.jar s3a://mybucket/data/vw s3a://mybucket/data/spark
I have tried specifying new output paths ( /data/spark1
etc), ensuring that it does not exist before the step is run. 我尝试指定新的输出路径(
/data/spark1
等),确保在运行该步骤之前不存在该路径。 Even then it is not working. 即使那样,它也不起作用。
What am I doing wrong? 我究竟做错了什么? I am new to Scala and Spark so I might be overlooking something here.
我是Scala和Spark的新手,所以我可能忽略了这里的内容。
You could convert to a dataframe and then save with overwrite enabled. 您可以转换为数据框,然后在启用覆盖的情况下保存。
coalescedSparkData.toDF.write.mode('overwrite').csv(outputPath)
Or if you insist on using RDD methods, you can do as described already in this answer 或者,如果您坚持使用RDD方法,则可以按照此答案中已经描述的方法进行操作
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.