[英]Convert Spark DataFrame to Pojo Object
Please see below code:请看下面的代码:
//Create Spark Context
SparkConf sparkConf = new SparkConf().setAppName("TestWithObjects").setMaster("local");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
//Creating RDD
JavaRDD<Person> personsRDD = javaSparkContext.parallelize(persons);
//Creating SQL context
SQLContext sQLContext = new SQLContext(javaSparkContext);
DataFrame personDataFrame = sQLContext.createDataFrame(personsRDD, Person.class);
personDataFrame.show();
personDataFrame.printSchema();
personDataFrame.select("name").show();
personDataFrame.registerTempTable("peoples");
DataFrame result = sQLContext.sql("SELECT * FROM peoples WHERE name='test'");
result.show();
After this I need to convert the DataFrame - 'result' to Person Object or List.在此之后,我需要将 DataFrame - 'result' 转换为 Person Object 或列表。 Thanks in advance.提前致谢。
DataFrame is simply a type alias of Dataset[Row] . DataFrame只是Dataset [Row]的类型别名。 These operations are also referred as “untyped transformations” in contrast to “typed transformations” that come with strongly typed Scala/Java Datasets. 与强类型Scala / Java数据集一起提供的“类型转换”相比,这些操作也称为“无类型转换”。
The conversion from Dataset[Row] to Dataset[Person] is very simple in spark 从数据集[Row]到Dataset [Person]的转换在spark中非常简单
DataFrame result = sQLContext.sql("SELECT * FROM peoples WHERE name='test'");
At this point, Spark converts your data into DataFrame = Dataset[Row], a collection of generic Row object, since it does not know the exact type. 此时,Spark将您的数据转换为DataFrame = Dataset [Row],这是一个通用Row对象的集合,因为它不知道确切的类型。
// Create an Encoders for Java beans
Encoder<Person> personEncoder = Encoders.bean(Person.class);
Dataset<Person> personDF = result.as(personEncoder);
personDF.show();
Now, Spark converts the Dataset[Row] -> Dataset[Person] type-specific Scala / Java JVM object, as dictated by the class Person. 现在,Spark转换数据集[Row] - > Dataset [Person]类型特定的Scala / Java JVM对象,由Person类指定。
Please refer to below link provided by databricks for further details 有关详细信息,请参阅databricks提供的以下链接
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
A DataFrame
is stored as Row
s, so you can use the methods there to cast from untyped to typed . DataFrame
存储为Row
,因此您可以使用那里的方法 DataFrame
类型转换为类型 。 Take a look at the get
methods. 看看get
方法。
If someone looking for conversion of json string column in Dataset<Row>
to Dataset<PojoClass>
如果有人正在寻找将Dataset<Row>
中的 json 字符串列转换为Dataset<PojoClass>
Sample pojo: Testing示例 pojo:测试
@Data
public class Testing implements Serializable {
private String name;
private String dept;
}
In the above code @Data
is an annotation from Lombok to generate getters and setters for this Testing
class.在上面的代码中,@Data 是来自 Lombok 的注释,用于为此Testing
@Data
生成 getter 和 setter。
Actual conversion logic in Spark Spark中的实际转换逻辑
@Test
void shouldConvertJsonStringToPojo() {
var sparkSession = SparkSession.builder().getOrCreate();
var structType = new StructType(new StructField[] {
new StructField("employee", DataTypes.StringType, false, Metadata.empty()),
});
var ds = sparkSession.createDataFrame(new ArrayList<>(
Arrays.asList(RowFactory.create(new Object[]{"{ \"name\": \"test\", \"dept\": \"IT\"}"}))), structType);
var objectMapper = new ObjectMapper();
var bean = Encoders.bean(Testing.class);
var testingDataset = ds.map((MapFunction<Row, Testing>) row -> {
var dept = row.<String>getAs("employee");
return objectMapper.readValue(dept, Testing.class);
}, bean);
assertEquals("test", testingDataset.head().getName());
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.