Create a DataFrame from a csv (url source) with scala

Question

I have a csv separated with comma stored online (https: //xxx. com/xx/xx.csv) . I can get it like that:

import scala.io.Source

val stringCsv = Source.fromURL(url,"UTF-8").mkString

Now i want to convert stringCsv into a Spark DataFrame without header .

Answer 1

I am guess that Source is scala.io.Source which would return an iterator. You could just get the LineIterator out of it, skip the first line and than turn it into a DataFrame.

This would work like this:

val raw = Source.fromURL(url,"UTF-8")

// skip the header
raw.next

// convert to DF
import spark.implicits._
val df = raw.toList.toDF

// here you end up with a dataframe of strings (So a row with a single column).

But that would be quite inefficient for bigger files. The spark way would be:

import org.apache.spark.SparkFiles
spark.sparkContext.addFile(spark.sparkContext.addFile(url)) 
val df = spark.read.format("csv")
  .option("sep", ";")
  .option("inferSchema", "true")
  .option("header", "true")
  .load("file://"+SparkFiles.get("yourfile.csv"))

There you have the option to define if your input has a header or not (besides a lot of other stuff). The trick might be that spark.sparkContext.addFile(url) register your file under the file name and not the full path (so URL like https://raw.githubusercontent.com/IBM/knative-serverless/master/src/destination/cities.csv would be registered under cities.csv )

Create a DataFrame from a csv (url source) with scala

Question

1 answers

solution1
0 2020-11-25 20:37:27

Create a DataFrame from a csv (url source) with scala

Question

1 answers

solution1 0 2020-11-25 20:37:27

solution1
0 2020-11-25 20:37:27