I have a CSV string which is an RDD and I need to convert it in to a spark DataFrame.
I will explain the problem from beginning.
I have this directory structure.
Csv_files (dir)
|- A.csv
|- B.csv
|- C.csv
All I have is access to Csv_files.zip, which is in a hdfs storage.
I could have directly read if each file was stored as A.gz, B.gz ... But it I have files within a directory which is compressed.
With the help of an answer on SO ( How to open/stream .zip files through Spark? ), I was able to convert this zip file in to a dictionary.
d = {
'A.csv':'A,B,C\n1,2,3\n4,5,6 ...'
'B.csv':'A,B,C\n7,8,9\n1,2,3 ...'
}
Now I should convert this csv_string 'A,B,C\\n1,2,3\\n4,5,6 ...'
to a dataframe. I tried this,
How can I efficiently convert csv_string to a meaningful dataframe ?
My Spark version is 1.6.2 and python 2.6.6.
You first have to split your dicts according to some csv compliant rules. For the example here, I will only use a split with newlines but you should pay attention to newlines inside values (spark 2.2 supports multilines csv records)
(scala code)
// original data as shown in the example
val d: Map[String, RDD[String]] = ...
// flatmap lines
val newRDDs: List[RDD[String]] = d.map(curRDD => {
// Split csv into multiple lines and drop header
curRDD._2.flatMap(_.split('\n').drop(1))
})
// Beware, this can be extremely costly if you have too many rdds.
val unionAll: RDD[String] = sc.union(newRDDs)
// Finally, create df from rows.
// In spark 2.2, you would do something like spark.read.csv(spark.createDataset(unionAll))
// In spark < 2.x, you need to parse manually to classes (or Row) and then sqlContext.createDataFrame(parsedRows)
NB: Code above has not been compile/tested and is here only to illustrate the idea.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.