简体   繁体   English

如何将文本文件作为参数scala传递

[英]How to pass text file as argument scala

I wish to write my wordcount program specifically so that I can pass my input textfile as an argument in main.I am very new to scala so I don't know the specifics as to how to pass it. 我想专门编写我的wordcount程序,以便我可以将输入文本文件作为main中的参数传递。我对scala非常陌生,所以我不知道如何传递它的细节。 I tried directly mentioning it in my main function as def main(args:"C:/Users/rsjadsa/Documents/input.txt" ) 我尝试在我的主要功能中直接将其称为def main(args:“ C:/Users/rsjadsa/Documents/input.txt”)

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object WordC {
 def main(args: String, args1 : String){
 val cf = new SparkConf().setAppName("WordCount").setMaster("local")
 val sc = new SparkContext(cf)
 val words = args.flatMap(line => line.split(" "))
 val wordCount = words.map(word => (word, 1)).reduceByKey(_ + _)
 wordCount.foreach(println)

 }
}

I just want to pass my textfile as a argument instead of hardcoding it and apply the same wordcount program on it. 我只想将文本文件作为参数传递,而不是对其进行硬编码并在其上应用相同的wordcount程序。 I know I am new to this language so sorry for the silly question 我知道我是这种语言的新手,很抱歉这个愚蠢的问题

It would be the first element in the Array of strings args but it depends on how you run the program as to what you want to do. 这将是字符串args数组中的第一个元素,但这取决于您如何运行程序以及要执行的操作。 This is just reading from command line argument and assigning to variable. 这只是从命令行参数读取并分配给变量。 You also need to make a schema if you want it to be in a DataFrame (which you probably should). 如果希望将其包含在DataFrame ,则还需要制作一个模式(您可能应该这样做)。

EDIT: Since you want to do the wordcount piece with the RDD I took out the DataFrame stuff because it was confusing. 编辑:由于您想使用RDD进行单词计数,所以我取出了DataFrame的东西,因为它令人困惑。 Also, you should collect the RDD to the Driver before you print to screen or it might do crazy stuff as the data is still in the executors. 另外,在打印到屏幕上之前,应将RDD收集到驱动程序中,否则它可能会发疯,因为数据仍在执行程序中。

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import sqlContext.implicits._
import org.apache.spark.sql.types.{StructType,StructField,StringType};
import org.apache.spark.sql.Row;


object WordC {
  def main(args: Array[String]): Unit = {

    // retrieve the filename 
    val filename = args(0)

    val cf = new SparkConf().setAppName("WordCount").setMaster("local")
    val sc = new SparkContext(cf)

    val inputRDD = sc.textFile(filename)

    val wordsRDD = inputRDD.flatMap(line => line.split(" "))
    val wordCountRDD = words.map(word => (word, 1)).reduceByKey(_ + _)
    wordCountRDD.collect.foreach(println(_))

  }
}

And then however you are running the program, the command line argument would just be C:/Users/rsjadsa/Documents/input.txt like scala WordC.scala "C:/Users/rsjadsa/Documents/input.txt" 然后,尽管您正在运行程序,但命令行参数将只是C:/Users/rsjadsa/Documents/input.txt,scala WordC.scala "C:/Users/rsjadsa/Documents/input.txt"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM