Spark：将CSV转换为RDD [行]

Question

I have a .csv file, which contains 258 columns in following structure. 我有一个.csv文件，该文件包含以下结构的258列。

["label", "index_1", "index_2", ... , "index_257"]

Now I wanna transform this .csv file to a RDD[Row]: 现在，我想将此.csv文件转换为RDD [Row]：

val data_csv = sc.textFile("~/test.csv")

val rowRDD = data_csv.map(_.split(",")).map(p => Row( p(0), p(1).trim, p(2).trim))

If I do the transform in this way, I have to write down 258 columns specifically. 如果以这种方式进行转换，则必须专门记下258列。 So I tried: 所以我尝试了：

val rowRDD = data_csv.map(_.split(",")).map(p => Row( _ => p(_).trim))

and 和

val rowRDD = data_csv.map(_.split(",")).map(p => Row( x => p(x).trim))

But these two also not working and report error: 但是这两个也不起作用并报告错误：

error: missing parameter type for expanded function ((x$2) => p(x$2).trim)

Can anyone tell me how to do this transform? 谁能告诉我如何进行转换？ Thanks a lot. 非常感谢。

Answer 1

you should use sqlContext instead of sparkContext as 您应该使用sqlContext而不是sparkContext作为

val df = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", true)
  .load(("~/test.csv")

this will create dataframe . 这将创建dataframe 。 calling .rdd on df should give you RDD[Row] 在df上调用.rdd应该会给您RDD[Row]

val rdd = df.rdd

Answer 2

Rather reading as a textFile read CSV files with the spark-csv 而是以textFile的形式读取使用spark-csv读取CSV文件

In your case 就你而言

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .option("quote", "\"")  //escape the quotes 
    .option("ignoreLeadingWhiteSpace", true)  // escape space before your data
    .load("cars.csv")

This loads data as a dataframe, now you can easily change it to RDD. 这会将数据作为数据帧加载，现在您可以轻松地将其更改为RDD。

Hope this helps! 希望这可以帮助！

Answer 3

Apart from the other answers that are correct, the correct way to do what you're trying to do is to use Row.fromSeq inside the map function. 除了正确的其他答案之外，执行您要执行的操作的正确方法是在map函数内使用Row.fromSeq 。

val rdd = sc.parallelize(Array((1 to 258).toArray, (1 to 258).toArray) )
            .map(Row.fromSeq(_))

This will turn your rdd to type Row : 这将使您的rdd键入Row ：

 Array[org.apache.spark.sql.Row] = Array([1,2,3,4,5,6,7,8,9,10...

Spark：将CSV转换为RDD [行]

问题描述

3 个解决方案

解决方案1
2 已采纳 2017-07-24 08:39:18

解决方案2
1 2017-07-24 09:05:11

解决方案3
1 2017-07-24 09:47:16

Spark：将CSV转换为RDD [行]

问题描述

3 个解决方案

解决方案1 2 已采纳 2017-07-24 08:39:18

解决方案2 1 2017-07-24 09:05:11

解决方案3 1 2017-07-24 09:47:16

解决方案1
2 已采纳 2017-07-24 08:39:18

解决方案2
1 2017-07-24 09:05:11

解决方案3
1 2017-07-24 09:47:16