简体   繁体   English

将 Spark DataFrame 转换为 Pojo Object

[英]Convert Spark DataFrame to Pojo Object

Please see below code:请看下面的代码:

    //Create Spark Context
    SparkConf sparkConf = new SparkConf().setAppName("TestWithObjects").setMaster("local");
    JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
    //Creating RDD
    JavaRDD<Person> personsRDD = javaSparkContext.parallelize(persons);
    //Creating SQL context
    SQLContext sQLContext = new SQLContext(javaSparkContext);
    DataFrame personDataFrame = sQLContext.createDataFrame(personsRDD, Person.class);
    personDataFrame.show();
    personDataFrame.printSchema();
    personDataFrame.select("name").show();
    personDataFrame.registerTempTable("peoples");
    DataFrame result = sQLContext.sql("SELECT * FROM peoples WHERE name='test'");
    result.show();

After this I need to convert the DataFrame - 'result' to Person Object or List.在此之后,我需要将 DataFrame - 'result' 转换为 Person Object 或列表。 Thanks in advance.提前致谢。

DataFrame is simply a type alias of Dataset[Row] . DataFrame只是Dataset [Row]的类型别名。 These operations are also referred as “untyped transformations” in contrast to “typed transformations” that come with strongly typed Scala/Java Datasets. 与强类型Scala / Java数据集一起提供的“类型转换”相比,这些操作也称为“无类型转换”。

The conversion from Dataset[Row] to Dataset[Person] is very simple in spark 从数据集[Row]到Dataset [Person]的转换在spark中非常简单

DataFrame result = sQLContext.sql("SELECT * FROM peoples WHERE name='test'");

At this point, Spark converts your data into DataFrame = Dataset[Row], a collection of generic Row object, since it does not know the exact type. 此时,Spark将您的数据转换为DataFrame = Dataset [Row],这是一个通用Row对象的集合,因为它不知道确切的类型。

// Create an Encoders for Java beans
Encoder<Person> personEncoder = Encoders.bean(Person.class); 
Dataset<Person> personDF = result.as(personEncoder);
personDF.show();

Now, Spark converts the Dataset[Row] -> Dataset[Person] type-specific Scala / Java JVM object, as dictated by the class Person. 现在,Spark转换数据集[Row] - > Dataset [Person]类型特定的Scala / Java JVM对象,由Person类指定。

Please refer to below link provided by databricks for further details 有关详细信息,请参阅databricks提供的以下链接

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

A DataFrame is stored as Row s, so you can use the methods there to cast from untyped to typed . DataFrame存储为Row ,因此您可以使用那里的方法 DataFrame 类型转换为类型 Take a look at the get methods. 看看get方法。

If someone looking for conversion of json string column in Dataset<Row> to Dataset<PojoClass>如果有人正在寻找将Dataset<Row>中的 json 字符串列转换Dataset<PojoClass>

Sample pojo: Testing示例 pojo:测试

@Data
public class Testing implements Serializable {
    private String name;
    private String dept;
}

In the above code @Data is an annotation from Lombok to generate getters and setters for this Testing class.在上面的代码中,@Data 是来自 Lombok 的注释,用于为此Testing @Data生成 getter 和 setter。

Actual conversion logic in Spark Spark中的实际转换逻辑

@Test
void shouldConvertJsonStringToPojo() {
   var sparkSession  = SparkSession.builder().getOrCreate(); 
   var structType =  new StructType(new StructField[] {
        new StructField("employee", DataTypes.StringType, false, Metadata.empty()),
    });

    var ds = sparkSession.createDataFrame(new ArrayList<>(
        Arrays.asList(RowFactory.create(new Object[]{"{ \"name\": \"test\", \"dept\": \"IT\"}"}))), structType);

    var objectMapper = new ObjectMapper();
    var bean = Encoders.bean(Testing.class);

    var testingDataset = ds.map((MapFunction<Row, Testing>) row -> {
        var dept = row.<String>getAs("employee");

        return objectMapper.readValue(dept, Testing.class);
    }, bean);

    assertEquals("test", testingDataset.head().getName());
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM