Data transfer from csv format to Redis hash format in Databricks

Question

I have an Azure system divided into three parts:

Azure Data lake storage where I have some csv file.
Azure Databricks where I need to make some process - exactly is to convert that csv file to Redis hash format.
Azure Redis cache where I should put that converted data.

After mounting storage in databricks filesystem there is a need to process some data. How to convert csv data located in databricks filesystem to redisHash format and correctly to put it to Redis? Specifically, I'm not sure how to make a correct mapping by the following code below. Or maybe is there some way of additional transfer to SQL table which I cannot find.

Here is my example of code written on scala:

import com.redislabs.provider.redis._

val redisServerDnsAddress = "HOST"
val redisPortNumber = 6379
val redisPassword = "Password"
val redisConfig = new RedisConfig(new RedisEndpoint(redisServerDnsAddress, redisPortNumber, redisPassword))


val data = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/mnt/staging/data/file.csv")

// What is the right way of mapping?
val ds = table("data").select("Prop1", "Prop2", "Prop3", "Prop4", "Prop5" ).distinct.na.drop().map{x =>
  (x.getString(0), x.getString(1), x.getString(2), x.getString(3), x.getString(4))
}

sc.toRedisHASH(ds, "data")

The error:

error: type mismatch;
 found   : org.apache.spark.sql.Dataset[(String, String)]
 required: org.apache.spark.rdd.RDD[(String, String)]
sc.toRedisHASH(ds, "data")

If I write the last string of code this way:

sc.toRedisHASH(ds.rdd, "data")

The error:

org.apache.spark.sql.AnalysisException: Table or view not found: data;

Answer 1

Prepare some sample data to mimic the data loading from CSV file.

    val rdd = spark.sparkContext.parallelize(Seq(Row("1", "2", "3", "4", "5", "6", "7")))
    val structType = StructType(
      Seq(
        StructField("Prop1", StringType),
        StructField("Prop2", StringType),
        StructField("Prop3", StringType),
        StructField("Prop4", StringType),
        StructField("Prop5", StringType),
        StructField("Prop6", StringType),
        StructField("Prop7", StringType)
      )
    )
    val data = spark.createDataFrame(rdd, structType)

Transformation:

val transformedData = data.select("Prop1", "Prop2", "Prop3", "Prop4", "Prop5").distinct.na.drop()

Write dataframe to Redis, use Prop1 as a key and data as a Redis table name. See docs

    transformedData
      .write
      .format("org.apache.spark.sql.redis")
      .option("key.column", "Prop1")
      .option("table", "data")
      .mode(SaveMode.Overwrite)
      .save()

Check data in Redis:

127.0.0.1:6379> keys data:*
1) "data:1"

127.0.0.1:6379> hgetall data:1
1) "Prop5"
2) "5"
3) "Prop2"
4) "2"
5) "Prop4"
6) "4"
7) "Prop3"
8) "3"

Data transfer from csv format to Redis hash format in Databricks

Question

1 answers

solution1
1 ACCPTED 2020-11-10 07:54:39

Data transfer from csv format to Redis hash format in Databricks

Question

1 answers

solution1 1 ACCPTED 2020-11-10 07:54:39

solution1
1 ACCPTED 2020-11-10 07:54:39