简体   繁体   中英

Convert a user defined object into dataframe and write into RDBMS - how to maintain mapping with database?

I have the following table structure in mysql :

create table user(
id INT NOT NULL,
name VARCHAR(20) NOT NULL,
age INT NOT NULL,
address VARCHAR(100) NOT NULL);

Now, I want to write a spark streaming job that reads data from Kafka, does some processing and filtering and writes to RDBMS in the table 'User'.

For this, I have firstly created a POJO representation of the table -

@Data
class User implements Serializable {
private int id;
private String name;
private int age;
private String address;
}

Below, I have written the spark job that converts the rdd into dataframe -

JavaDStream<User> userStream = ... // created this stream with some processing
userStream.foreachRDD(rdd -> {
DataFrame df = sqlContext.createDataFrame(rdd,User.class);
df.write().mode(SaveMode.Append).jdbc(MYSQL_CONNECTION_URL, "user", new java.util.Properties());
});

Now, once I execute this piece of code, because the data frame is formed in the hap hazard manner and it is not synced with the database schema. Hence, it tries to insert the 'address' in the 'id' column and exits with a sql exception.

I cannot understand how I could make the data frame understand the schema of the database and load the data from User object accordingly. Is there any way to do that? I think JavaRDD can be mapped to JavaRDD , but then I cannot understand what to do further.

Also, I believe this createDataFrame() API processes using reflection (has to) and hence, there is also a question of performance impact. Can you tell me if there is a way to maintain the mapping between POJO and the relational database, and insert the data?

Doing it this way has worked for me.

@Data
class User implements Serializable {
private int id;
private String name;
private int age;
private String address;
private static StructType structType = DataTypes.createStructType(new StructField[] {
        DataTypes.createStructField("id", DataTypes.IntegerType, false),
        DataTypes.createStructField("name", DataTypes.StringType, false),
        DataTypes.createStructField("age", DataTypes.IntegerType, false),
        DataTypes.createStructField("address", DataTypes.StringType, false)
});

public static StructType getStructType() {
    return structType;
}

public Object[] getAllValues() {
    return new Object[]{id, name, age, address};
}

}

The spark job -

JavaDStream<User> userStream = ... // created this stream with some processing
userStream.map(e -> {
            Row row = RowFactory.create(e.getAllValues());
            return row;
        }).foreachRDD(rdd -> {
            DataFrame df = sqlContext.createDataFrame(rdd,User.getStructType());
            df.write().mode(SaveMode.Append).jdbc(MYSQL_CONNECTION_URL, "user", new java.util.Properties());
        });

I think this is a better way to do than the previous because in the previous one, dataframe uses reflection to map the POJO into its own data structure. This is a cleaner way because I am already Row is a format of spark sql itself and I am already mentioning the order of insertion of data into dataframe in getAllValues() and the column mapping in getStructType()

Please correct me if I am wrong.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM