简体   繁体   English

在没有数据块的情况下将CSV文件转换为Spark 1.5.2中的数据帧

[英]Convert csv file to dataframe in Spark 1.5.2 without databricks

I am trying to convert a csv file to a dataframe in Spark 1.5.2 with Scala without the use of the library databricks, as it is a community project and this library is not available. 我正在尝试使用Scala在Spark 1.5.2中将csv文件转换为数据帧,而不使用库databricks,因为它是社区项目,并且该库不可用。 My approach was the following: 我的方法如下:

var inputPath  = "input.csv"
var text = sc.textFile(inputPath)
var rows = text.map(line => line.split(",").map(_.trim))
var header = rows.first()
var data = rows.filter(_(0) != header(0))
var df = sc.makeRDD(1 to data.count().toInt).map(i => (data.take(i).drop(i-1)(0)(0), data.take(i).drop(i-1)(0)(1), data.take(i).drop(i-1)(0)(2), data.take(i).drop(i-1)(0)(3), data.take(i).drop(i-1)(0)(4))).toDF(header(0), header(1), header(2), header(3), header(4))

This code, even though it is quite a mess, works without returning any error messages. 即使很乱,该代码也可以正常工作,而不会返回任何错误消息。 The problem comes when trying to display the data inside df in order to verify the correctness of this method and later try to do some queries in df . 当尝试在df中显示数据以验证此方法的正确性并随后尝试在df进行一些查询时,就会出现问题。 The error code I am getting after executing df.show() is SPARK-5063 . 执行df.show()后出现的错误代码是SPARK-5063 My questions are: 我的问题是:

1) Why is it not possible to print the content of df ? 1)为什么无法打印df的内容?

2) Is there any other more straightforward method to convert a csv to a dataframe in Spark 1.5.2 without using the library databricks ? 2)是否有其他更直接的方法可以在不使用库databricks情况下将Spark 1.5.2转换为Spark 1.5.2中的databricks

For spark 1.5.x can be used code snippet below to convert input into DF 对于spark 1.5.x,可以使用下面的代码片段将输入转换为DF

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the DataClass interface with 5 fields.
case class DataClass(id: Int, name: String, surname: String, bdate: String, address: String)

// Create an RDD of DataClass objects and register it as a table.
val peopleData = sc.textFile("input.csv").map(_.split(",")).map(p => DataClass(p(0).trim.toInt, p(1).trim, p(2).trim, p(3).trim, p(4).trim)).toDF()
peopleData.registerTempTable("dataTable")

val peopleDataFrame = sqlContext.sql("SELECT * from dataTable")

peopleDataFrame.show()

Spark 1.5 火花1.5

You can create like this: 您可以这样创建:

SparkSession spark = SparkSession
                .builder()
                .appName("RDDtoDF_Updated")
                .master("local[2]")
                .config("spark.some.config.option", "some-value")
                .getOrCreate();

        StructType schema = DataTypes
                .createStructType(new StructField[] {
                        DataTypes.createStructField("eid", DataTypes.IntegerType, false),
                        DataTypes.createStructField("eName", DataTypes.StringType, false),
                        DataTypes.createStructField("eAge", DataTypes.IntegerType, true),
                        DataTypes.createStructField("eDept", DataTypes.IntegerType, true),
                        DataTypes.createStructField("eSal", DataTypes.IntegerType, true),
                        DataTypes.createStructField("eGen", DataTypes.StringType,true)});


        String filepath = "F:/Hadoop/Data/EMPData.txt";
        JavaRDD<Row> empRDD = spark.read()
                .textFile(filepath)
                .javaRDD()
                .map(line -> line.split("\\,"))
                .map(r -> RowFactory.create(Integer.parseInt(r[0]), r[1].trim(),Integer.parseInt(r[2]),
                        Integer.parseInt(r[3]),Integer.parseInt(r[4]),r[5].trim() ));


        Dataset<Row> empDF = spark.createDataFrame(empRDD, schema);
        empDF.groupBy("edept").max("esal").show();

Using Spark with Scala. 在Scala中使用Spark。

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

var hiveCtx = new HiveContext(sc)
var inputPath  = "input.csv"
var text = sc.textFile(inputPath)
var rows = text.map(line => line.split(",").map(_.trim)).map(a => Row.fromSeq(a))
var header = rows.first()
val schema = StructType(header.map(fieldName => StructField(fieldName.asInstanceOf[String],StringType,true)))

val df = hiveCtx.createDataframe(rows,schema)

This should work. 这应该工作。

But for creating dataframe, would recommend you to use Spark-CSV . 但是对于创建数据框,建议您使用Spark-CSV

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用没有Databricks的scala将spark 3.0 sql数据帧写入CSV文件时出错 - Error while writing spark 3.0 sql dataframe to CSV file using scala without Databricks Spark - 将 CSV 文件加载为 DataFrame? - Spark - load CSV file as DataFrame? Spark&Scala:以DataFrame / Dataset的形式读取CSV文件 - Spark & Scala: Read in CSV file as DataFrame / Dataset Spark -Scala - 将 CSV 文件转换为自定义对象 - Spark -Scala - Convert CSV file to custom object 在 Spark 1.X 中将 Excel 文件转换为 csv - Convert Excel file to csv in Spark 1.X 如何将带有double数组的文件转换为spark中的dataframe? - How to convert file with array of double to dataframe in spark? 如何在没有火花的情况下将hadoop avro,镶木地板以及文本文件转换为csv - How to convert hadoop avro, parquet, as well as text file to csv without spark Trying to convert a “org.apache.spark.sql.DataFrame” object to pandas dataframe results in error “name 'dataframe' is not defined” in Databricks - Trying to convert a “org.apache.spark.sql.DataFrame” object to pandas dataframe results in error “name 'dataframe' is not defined” in Databricks Spark Databricks 本地文件 API - Spark Databricks local file API com.databricks.spark.csv版本要求 - com.databricks.spark.csv version requirement
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM