[英]How to read input from a file and convert data lines of the file to List[Map[Int,String]] using scala?
My Query is, read input from a file and convert data lines of the file to List[Map[Int,String]] using scala. 我的查询是,从文件中读取输入,并使用scala将文件的数据行转换为List [Map [Int,String]]。 Here I give a dataset as the input.
在这里,我给出一个数据集作为输入。 My code is,
我的代码是
def id3(attrs: Attributes,
examples: List[Example],
label: Symbol
) : Node = {
level = level+1
// if all the examples have the same label, return a new node with that label
if(examples.forall( x => x(label) == examples(0)(label))){
new Leaf(examples(0)(label))
} else {
for(a <- attrs.keySet-label){ //except label, take all attrs
("Information gain for %s is %f".format(a,
informationGain(a,attrs,examples,label)))
}
// find the best splitting attribute - this is an argmax on a function over the list
var bestAttr:Symbol = argmax(attrs.keySet-label, (x:Symbol) =>
informationGain(x,attrs,examples,label))
// now we produce a new branch, which splits on that node, and recurse down the nodes.
var branch = new Branch(bestAttr)
for(v <- attrs(bestAttr)){
val subset = examples.filter(x=> x(bestAttr)==v)
if(subset.size == 0){
// println(levstr+"Tiny subset!")
// zero subset, we replace with a leaf labelled with the most common label in
// the examples
val m = examples.map(_(label))
val mostCommonLabel = m.toSet.map((x:Symbol) => (x,m.count(_==x))).maxBy(_._2)._1
branch.add(v,new Leaf(mostCommonLabel))
}
else {
// println(levstr+"Branch on %s=%s!".format(bestAttr,v))
branch.add(v,id3(attrs,subset,label))
}
}
level = level-1
branch
}
}
}
object samplet {
def main(args: Array[String]){
var attrs: sample.Attributes = Map()
attrs += ('0 -> Set('abc,'nbv,'zxc))
attrs += ('1 -> Set('def,'ftr,'tyh))
attrs += ('2 -> Set('ghi,'azxc))
attrs += ('3 -> Set('jkl,'fds))
attrs += ('4 -> Set('mno,'nbh))
val examples: List[sample.Example] = List(
Map(
'0 -> 'abc,
'1 -> 'def,
'2 -> 'ghi,
'3 'jkl,
'4 -> 'mno
),
........................
)
// obviously we can't use the label as an attribute, that would be silly!
val label = 'play
println(sample.try(attrs,examples,label).getStr(0))
}
}
But How I change this code to - accepting input from a .csv file? 但是,如何将这段代码更改为-接受来自.csv文件的输入?
I suggest you use Java's io / nio standard library to read your CSV file. 我建议您使用Java的io / nio标准库读取CSV文件。 I think there is no relevant drawback in doing so.
我认为这样做没有任何相关的缺点。
But the first question we need to answer is where to read the file in the code? 但是我们需要回答的第一个问题是在代码中从哪里读取文件? The parsed input seems to replace the value of
examples
. 解析的输入似乎替换了
examples
的值。 This fact also hints us what type the parsed CSV input must have, namely List[Map[Symbol, Symbol]]
. 这个事实也提示我们解析的CSV输入必须具有什么类型 ,即
List[Map[Symbol, Symbol]]
。 So let us declare a new class 因此,让我们声明一个新类
class InputFromCsvLoader(charset: Charset = Charset.defaultCharset()) {
def getInput(file: Path): List[Map[Symbol, Symbol]] = ???
}
Note that the Charset
is only needed if we must distinguish between differently encoded CSV-files. 请注意,只有在我们必须区分编码不同的CSV文件时,才需要
Charset
。
Okay, so how do we implement the method? 好的,我们如何实现该方法? It should do the following:
它应该执行以下操作:
attributes
as key attributes
作为键,从符号列表中构建地图 Or expressed in code: 或用代码表示:
class InputFromCsvLoader(charset: Charset = Charset.defaultCharset()) {
val Attributes = List('outlook, 'temperature, 'humidity, 'wind, 'play)
val Separator = ","
/** Get the desired input from the CSV file. Does not perform any checks, i.e., there are no guarantees on what happens if the input is malformed. */
def getInput(file: Path): List[Map[Symbol, Symbol]] = {
val reader = Files.newBufferedReader(file, charset)
/* Read the whole file and discard the first line */
inputWithHeader(reader).tail
}
/** Reads all lines in the CSV file using [[java.io.BufferedReader]] There are many ways to do this and this is probably not the prettiest. */
private def inputWithHeader(reader: BufferedReader): List[Map[Symbol, Symbol]] = {
(JavaConversions.asScalaIterator(reader.lines().iterator()) foldLeft Nil.asInstanceOf[List[Map[Symbol, Symbol]]]){
(accumulator, nextLine) =>
parseLine(nextLine) :: accumulator
}.reverse
}
/** Parse an entry. Does not verify the input: If there are less attributes than columns or vice versa, zip creates a list of the size of the shorter list */
private def parseLine(line: String): Map[Symbol, Symbol] = (Attributes zip (line split Separator map parseSymbol)).toMap
/** Create a symbol from a String... we could also check whether the string represents a valid symbol */
private def parseSymbol(symbolAsString: String): Symbol = Symbol(symbolAsString)
}
Caveat: Expecting only valid input, we are certain that the individual symbol representations do not contain the comma-separation character. 警告:仅期望有效输入,我们确定各个符号表示形式均不包含逗号分隔字符。 If this cannot be assumed, then the code as is would fail to split certain valid input strings.
如果不能假定这样做,那么按原样的代码将无法拆分某些有效的输入字符串。
To use this new code, we could change the main
-method as follows: 要使用此新代码,我们可以如下更改
main
方法:
def main(args: Array[String]){
val csvInputFile: Option[Path] = args.headOption map (p => Paths get p)
val examples = (csvInputFile map new InputFromCsvLoader().getInput).getOrElse(exampleInput)
// ... your code
Here, examples
uses the value exampleInput
, which is the current, hardcoded value of examples
if no input argument is specified. 在这里,
examples
使用值exampleInput
,如果未指定输入参数,则该值是examples
的当前硬编码值。
Important: In the code, all error handling has been omitted for convenience. 重要说明:在代码中,为方便起见,所有错误处理均已省略。 In most cases, errors can occur when reading from files and user input must be considered invalid, so sadly, error handling at the boundaries of your program is usally not optional.
在大多数情况下,从文件读取时可能会发生错误,并且必须将用户输入视为无效,因此可悲的是,在程序边界处的错误处理通常不是可选的。
Side-notes: 旁注:
null
in your code. null
。 Returning Option[T]
is a better option than returning null
, because it makes "nullness" explicit and provides static safety thanks to the type-system. Option[T]
比返回null
更好,因为它使类型为“ nullness”,并由于类型系统而提供了静态安全性。 return
-keyword is not required in Scala, as the last value of a method is always returned. return
-keyword,因为总是返回方法的最后一个值。 You can still use the keyword if you find the code more readable or if you want to break in the middle of your method (which is usually a bad idea). val
over var
, because immutable values are much easier to understand than mutable values. val
不是var
,因为不可变值比可变值更容易理解。 TRUE
and FALSE
which are not legal according to your programs logic (they should be true
and false
instead). TRUE
和FALSE
,这根据您的程序逻辑是不合法的(取而代之,它们应为true
和false
)。 'wind
is bad, but it does not tell me what the actual value is. 'wind
”属性的值是错误的,但没有告诉我实际值是什么。 Read a csv file , 读取一个csv文件,
val datalines = Source.fromFile(filepath).getLines()
So this datalines contains all the lines from the csv file. 因此,此数据行包含csv文件中的所有行。
Next, convert each line into a Map[Int,String] 接下来,将每行转换为Map [Int,String]
val datamap = datalines.map{ line =>
line.split(",").zipWithIndex.map{ case (word, idx) => idx -> word}.toMap
}
Here, we split each line with "," . 在这里,我们用“,”分割每一行。 Then construct a map with key as column number and value as each word after the split.
然后构造一个映射, 键为列号, 值为拆分后的每个单词 。
Next, If we want List[Map[Int,String]] , 接下来,如果我们想要List [Map [Int,String]] ,
val datamap = datalines.map{ line =>
line.split(",").zipWithIndex.map{ case (word, idx) => idx -> word}.toMap
}.toList
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.