简体   繁体   中英

How do I create a DataSet from a parquet?

I have the following code to read data from a parquet to Dataframe

DataFrame addressDF = sqlContext.read().parquet(addressParquetPath);

How do i read data from parquet to DATA SET?

Dataset dataset = sqlContext.createDataset(sqlContext.read().parquet(propertyParquetPath).toJavaRDD(), Encoder.);

What should the Encoder parameter contain? Also, Do i have to create a property class and then pass that or how is it?

The Encoder for a type T is the class that tells Spark how instances of T can be decoded and~ encoded from the internal Spark representation. It contains the schema of the class and the scala ClassTag which is used to create your class via reflection.

In your code, you don't specialize Dataset over any type T, so I cannot create an Encoder for you but I can give you as example the one from Databricks Spark documentation , which I suggest to read because it is great. First of all, let's create the class University that we want to load into a DateSet:

public class University implements Serializable {
    private String name;
    private long numStudents;
    private long yearFounded;

    public void setName(String name) {...}
    public String getName() {...}
    public void setNumStudents(long numStudents) {...}
    public long getNumStudents() {...}
    public void setYearFounded(long yearFounded) {...}
    public long getYearFounded() {...}
}

Now, University is a Java Bean and the Spark Encoders library provides a way to create encoders for Java Beans with the function bean :

Encoder<University> universityEncoder = Encoders.bean(University.class)

which can then be used to read a Dataset of University from parquet without first loading them into a DataFrame (which is redundant):

Dataset<University> schools = context.read().json("/schools.json").as(universityEncoder);

and now schools is a Dataset<University> read from a parquet file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM