[英]Spark: convert a CSV to RDD[Row]
I have a .csv file, which contains 258 columns in following structure. 我有一个.csv文件,该文件包含以下结构的258列。
["label", "index_1", "index_2", ... , "index_257"]
Now I wanna transform this .csv file to a RDD[Row]: 现在,我想将此.csv文件转换为RDD [Row]:
val data_csv = sc.textFile("~/test.csv")
val rowRDD = data_csv.map(_.split(",")).map(p => Row( p(0), p(1).trim, p(2).trim))
If I do the transform in this way, I have to write down 258 columns specifically. 如果以这种方式进行转换,则必须专门记下258列。 So I tried:
所以我尝试了:
val rowRDD = data_csv.map(_.split(",")).map(p => Row( _ => p(_).trim))
and 和
val rowRDD = data_csv.map(_.split(",")).map(p => Row( x => p(x).trim))
But these two also not working and report error: 但是这两个也不起作用并报告错误:
error: missing parameter type for expanded function ((x$2) => p(x$2).trim)
Can anyone tell me how to do this transform? 谁能告诉我如何进行转换? Thanks a lot.
非常感谢。
you should use sqlContext
instead of sparkContext
as 您应该使用
sqlContext
而不是sparkContext
作为
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", true)
.load(("~/test.csv")
this will create dataframe
. 这将创建
dataframe
。 calling .rdd
on df
should give you RDD[Row]
在
df
上调用.rdd
应该会给您RDD[Row]
val rdd = df.rdd
Rather reading as a textFile read CSV files with the spark-csv 而是以textFile的形式读取使用spark-csv读取CSV文件
In your case 就你而言
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.option("quote", "\"") //escape the quotes
.option("ignoreLeadingWhiteSpace", true) // escape space before your data
.load("cars.csv")
This loads data as a dataframe, now you can easily change it to RDD. 这会将数据作为数据帧加载,现在您可以轻松地将其更改为RDD。
Hope this helps! 希望这可以帮助!
Apart from the other answers that are correct, the correct way to do what you're trying to do is to use Row.fromSeq
inside the map function. 除了正确的其他答案之外,执行您要执行的操作的正确方法是在map函数内使用
Row.fromSeq
。
val rdd = sc.parallelize(Array((1 to 258).toArray, (1 to 258).toArray) )
.map(Row.fromSeq(_))
This will turn your rdd
to type Row
: 这将使您的
rdd
键入Row
:
Array[org.apache.spark.sql.Row] = Array([1,2,3,4,5,6,7,8,9,10...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.