简体   繁体   English

Spark:将CSV转换为RDD [行]

[英]Spark: convert a CSV to RDD[Row]

I have a .csv file, which contains 258 columns in following structure. 我有一个.csv文件,该文件包含以下结构的258列。

["label", "index_1", "index_2", ... , "index_257"]

Now I wanna transform this .csv file to a RDD[Row]: 现在,我想将此.csv文件转换为RDD [Row]:

val data_csv = sc.textFile("~/test.csv")

val rowRDD = data_csv.map(_.split(",")).map(p => Row( p(0), p(1).trim, p(2).trim)) 

If I do the transform in this way, I have to write down 258 columns specifically. 如果以这种方式进行转换,则必须专门记下258列。 So I tried: 所以我尝试了:

val rowRDD = data_csv.map(_.split(",")).map(p => Row( _ => p(_).trim)) 

and

val rowRDD = data_csv.map(_.split(",")).map(p => Row( x => p(x).trim))

But these two also not working and report error: 但是这两个也不起作用并报告错误:

error: missing parameter type for expanded function ((x$2) => p(x$2).trim)

Can anyone tell me how to do this transform? 谁能告诉我如何进行转换? Thanks a lot. 非常感谢。

you should use sqlContext instead of sparkContext as 您应该使用sqlContext而不是sparkContext作为

val df = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", true)
  .load(("~/test.csv")

this will create dataframe . 这将创建dataframe calling .rdd on df should give you RDD[Row] df上调用.rdd应该会给您RDD[Row]

val rdd = df.rdd

Rather reading as a textFile read CSV files with the spark-csv 而是以textFile的形式读取使用spark-csv读取CSV文件

In your case 就你而言

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .option("quote", "\"")  //escape the quotes 
    .option("ignoreLeadingWhiteSpace", true)  // escape space before your data
    .load("cars.csv")

This loads data as a dataframe, now you can easily change it to RDD. 这会将数据作为数据帧加载,现在您可以轻松地将其更改为RDD。

Hope this helps! 希望这可以帮助!

Apart from the other answers that are correct, the correct way to do what you're trying to do is to use Row.fromSeq inside the map function. 除了正确的其他答案之外,执行您要执行的操作的正确方法是在map函数内使用Row.fromSeq

val rdd = sc.parallelize(Array((1 to 258).toArray, (1 to 258).toArray) )
            .map(Row.fromSeq(_))

This will turn your rdd to type Row : 这将使您的rdd键入Row

 Array[org.apache.spark.sql.Row] = Array([1,2,3,4,5,6,7,8,9,10...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM