简体   繁体   English

如何使用scala从文件中读取输入并将文件的数据行转换为List [Map [Int,String]]?

[英]How to read input from a file and convert data lines of the file to List[Map[Int,String]] using scala?

My Query is, read input from a file and convert data lines of the file to List[Map[Int,String]] using scala. 我的查询是,从文件中读取输入,并使用scala将文件的数据行转换为List [Map [Int,String]]。 Here I give a dataset as the input. 在这里,我给出一个数据集作为输入。 My code is, 我的代码是

  def id3(attrs: Attributes,
      examples: List[Example],
      label: Symbol
       ) : Node = {
level = level+1


  // if all the examples have the same label, return a new node with that label

  if(examples.forall( x => x(label) == examples(0)(label))){
  new Leaf(examples(0)(label))
  } else {
  for(a <- attrs.keySet-label){          //except label, take all attrs
    ("Information gain for %s is %f".format(a,
      informationGain(a,attrs,examples,label)))
  }


  // find the best splitting attribute - this is an argmax on a function over the list

  var bestAttr:Symbol = argmax(attrs.keySet-label, (x:Symbol) =>
    informationGain(x,attrs,examples,label))




  // now we produce a new branch, which splits on that node, and recurse down the nodes.

  var branch = new Branch(bestAttr)

  for(v <- attrs(bestAttr)){


    val subset = examples.filter(x=> x(bestAttr)==v)



    if(subset.size == 0){
      // println(levstr+"Tiny subset!")
      // zero subset, we replace with a leaf labelled with the most common label in
      // the examples
      val m = examples.map(_(label))
      val mostCommonLabel = m.toSet.map((x:Symbol) => (x,m.count(_==x))).maxBy(_._2)._1
      branch.add(v,new Leaf(mostCommonLabel))

    }
    else {
      // println(levstr+"Branch on %s=%s!".format(bestAttr,v))

      branch.add(v,id3(attrs,subset,label))
    }
   }
  level = level-1
  branch
  }
  }
  }
object samplet {
def main(args: Array[String]){

var attrs: sample.Attributes = Map()
attrs += ('0 -> Set('abc,'nbv,'zxc))
attrs += ('1 -> Set('def,'ftr,'tyh))
attrs += ('2 -> Set('ghi,'azxc))
attrs += ('3 -> Set('jkl,'fds))
attrs += ('4 -> Set('mno,'nbh))



val examples: List[sample.Example] = List(
  Map(
    '0 -> 'abc,
    '1 -> 'def,
    '2 -> 'ghi,
    '3 'jkl,
    '4 -> 'mno
  ),
  ........................
  )


// obviously we can't use the label as an attribute, that would be silly!
val label = 'play

println(sample.try(attrs,examples,label).getStr(0))

}
}

But How I change this code to - accepting input from a .csv file? 但是,如何将这段代码更改为-接受来自.csv文件的输入?

I suggest you use Java's io / nio standard library to read your CSV file. 我建议您使用Java的io / nio标准库读取CSV文件。 I think there is no relevant drawback in doing so. 我认为这样做没有任何相关的缺点。

But the first question we need to answer is where to read the file in the code? 但是我们需要回答的第一个问题是在代码中从哪里读取文件? The parsed input seems to replace the value of examples . 解析的输入似乎替换了examples的值。 This fact also hints us what type the parsed CSV input must have, namely List[Map[Symbol, Symbol]] . 这个事实也提示我们解析的CSV输入必须具有什么类型 ,即List[Map[Symbol, Symbol]] So let us declare a new class 因此,让我们声明一个新类

class InputFromCsvLoader(charset: Charset = Charset.defaultCharset()) {
  def getInput(file: Path): List[Map[Symbol, Symbol]] = ???
}

Note that the Charset is only needed if we must distinguish between differently encoded CSV-files. 请注意,只有在我们必须区分编码不同的CSV文件时,才需要Charset

Okay, so how do we implement the method? 好的,我们如何实现该方法? It should do the following: 它应该执行以下操作:

  1. Create an appropriate input reader 创建合适的输入阅读器
  2. Read all lines 阅读所有行
  3. Split each line at the comma-separator 在逗号分隔符处分割每一行
  4. Transform each substring into the symbol it represents 将每个子字符串转换为它代表的符号
  5. Build a map from from the list of symbols, using the attributes as key 使用attributes作为键,从符号列表中构建地图
  6. Create and return the list of maps 创建并返回地图列表

Or expressed in code: 或用代码表示:

class InputFromCsvLoader(charset: Charset = Charset.defaultCharset()) {
  val Attributes = List('outlook, 'temperature, 'humidity, 'wind, 'play)
  val Separator = ","

  /** Get the desired input from the CSV file. Does not perform any checks, i.e., there are no guarantees on what happens if the input is malformed. */
  def getInput(file: Path): List[Map[Symbol, Symbol]] = {
    val reader = Files.newBufferedReader(file, charset)
    /* Read the whole file and discard the first line */
    inputWithHeader(reader).tail
  }

  /** Reads all lines in the CSV file using [[java.io.BufferedReader]] There are many ways to do this and this is probably not the prettiest. */
  private def inputWithHeader(reader: BufferedReader): List[Map[Symbol, Symbol]] = {
    (JavaConversions.asScalaIterator(reader.lines().iterator()) foldLeft Nil.asInstanceOf[List[Map[Symbol, Symbol]]]){
      (accumulator, nextLine) =>
        parseLine(nextLine) :: accumulator
    }.reverse
  }

  /** Parse an entry. Does not verify the input: If there are less attributes than columns or vice versa, zip creates a list of the size of the shorter list */
  private def parseLine(line: String): Map[Symbol, Symbol] = (Attributes zip (line split Separator map parseSymbol)).toMap

  /** Create a symbol from a String... we could also check whether the string represents a valid symbol */
  private def parseSymbol(symbolAsString: String): Symbol = Symbol(symbolAsString)
}

Caveat: Expecting only valid input, we are certain that the individual symbol representations do not contain the comma-separation character. 警告:仅期望有效输入,我们确定各个符号表示形式均不包含逗号分隔字符。 If this cannot be assumed, then the code as is would fail to split certain valid input strings. 如果不能假定这样做,那么按原样的代码将无法拆分某些有效的输入字符串。

To use this new code, we could change the main -method as follows: 要使用此新代码,我们可以如下更改main方法:

def main(args: Array[String]){
  val csvInputFile: Option[Path] = args.headOption map (p => Paths get p)
  val examples = (csvInputFile map new InputFromCsvLoader().getInput).getOrElse(exampleInput)
  // ... your code

Here, examples uses the value exampleInput , which is the current, hardcoded value of examples if no input argument is specified. 在这里, examples使用值exampleInput ,如果未指定输入参数,则该值是examples的当前硬编码值。

Important: In the code, all error handling has been omitted for convenience. 重要说明:在代码中,为方便起见,所有错误处理均已省略。 In most cases, errors can occur when reading from files and user input must be considered invalid, so sadly, error handling at the boundaries of your program is usally not optional. 在大多数情况下,从文件读取时可能会发生错误,并且必须将用户输入视为无效,因此可悲的是,在程序边界处的错误处理通常不是可选的。

Side-notes: 旁注:

  • Try not to use null in your code. 尽量不要在代码中使用null Returning Option[T] is a better option than returning null , because it makes "nullness" explicit and provides static safety thanks to the type-system. 返回Option[T]比返回null更好,因为它使类型为“ nullness”,并由于类型系统而提供了静态安全性。
  • The return -keyword is not required in Scala, as the last value of a method is always returned. Scala中不需要return -keyword,因为总是返回方法的最后一个值。 You can still use the keyword if you find the code more readable or if you want to break in the middle of your method (which is usually a bad idea). 如果您发现代码更具可读性,或者想在方法的中间打断(通常是个坏主意),则仍然可以使用关键字。
  • Prefer val over var , because immutable values are much easier to understand than mutable values. 最好使用val不是var ,因为不可变值比可变值更容易理解。
  • The code will fail with the provided CSV string, because it contains the symbols TRUE and FALSE which are not legal according to your programs logic (they should be true and false instead). 该代码将无法使用提供的CSV字符串,因为它包含符号TRUEFALSE ,这根据您的程序逻辑是不合法的(取而代之,它们应为truefalse )。
  • Add all information to your error-messages. 将所有信息添加到您的错误消息中。 Your error message only tells me what that a value for the attribute 'wind is bad, but it does not tell me what the actual value is. 您的错误消息仅告诉我'wind ”属性的值是错误的,但没有告诉我实际值是什么。

Read a csv file , 读取一个csv文件,

val datalines = Source.fromFile(filepath).getLines()

So this datalines contains all the lines from the csv file. 因此,此数据行包含csv文件中的所有行。

Next, convert each line into a Map[Int,String] 接下来,将每行转换为Map [Int,String]

val datamap = datalines.map{ line =>
    line.split(",").zipWithIndex.map{ case (word, idx) => idx -> word}.toMap
    }

Here, we split each line with "," . 在这里,我们用“,”分割每一行。 Then construct a map with key as column number and value as each word after the split. 然后构造一个映射, 号, 值为拆分后的每个单词

Next, If we want List[Map[Int,String]] , 接下来,如果我们想要List [Map [Int,String]]

val datamap = datalines.map{ line =>
    line.split(",").zipWithIndex.map{ case (word, idx) => idx -> word}.toMap
    }.toList

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM