I have an Azure system divided into three parts:
After mounting storage in databricks filesystem there is a need to process some data. How to convert csv data located in databricks filesystem to redisHash format and correctly to put it to Redis? Specifically, I'm not sure how to make a correct mapping by the following code below. Or maybe is there some way of additional transfer to SQL table which I cannot find.
Here is my example of code written on scala:
import com.redislabs.provider.redis._
val redisServerDnsAddress = "HOST"
val redisPortNumber = 6379
val redisPassword = "Password"
val redisConfig = new RedisConfig(new RedisEndpoint(redisServerDnsAddress, redisPortNumber, redisPassword))
val data = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/mnt/staging/data/file.csv")
// What is the right way of mapping?
val ds = table("data").select("Prop1", "Prop2", "Prop3", "Prop4", "Prop5" ).distinct.na.drop().map{x =>
(x.getString(0), x.getString(1), x.getString(2), x.getString(3), x.getString(4))
}
sc.toRedisHASH(ds, "data")
The error:
error: type mismatch;
found : org.apache.spark.sql.Dataset[(String, String)]
required: org.apache.spark.rdd.RDD[(String, String)]
sc.toRedisHASH(ds, "data")
If I write the last string of code this way:
sc.toRedisHASH(ds.rdd, "data")
The error:
org.apache.spark.sql.AnalysisException: Table or view not found: data;
Prepare some sample data to mimic the data loading from CSV file.
val rdd = spark.sparkContext.parallelize(Seq(Row("1", "2", "3", "4", "5", "6", "7")))
val structType = StructType(
Seq(
StructField("Prop1", StringType),
StructField("Prop2", StringType),
StructField("Prop3", StringType),
StructField("Prop4", StringType),
StructField("Prop5", StringType),
StructField("Prop6", StringType),
StructField("Prop7", StringType)
)
)
val data = spark.createDataFrame(rdd, structType)
Transformation:
val transformedData = data.select("Prop1", "Prop2", "Prop3", "Prop4", "Prop5").distinct.na.drop()
Write dataframe to Redis, use Prop1
as a key and data
as a Redis table name. See docs
transformedData
.write
.format("org.apache.spark.sql.redis")
.option("key.column", "Prop1")
.option("table", "data")
.mode(SaveMode.Overwrite)
.save()
Check data in Redis:
127.0.0.1:6379> keys data:*
1) "data:1"
127.0.0.1:6379> hgetall data:1
1) "Prop5"
2) "5"
3) "Prop2"
4) "2"
5) "Prop4"
6) "4"
7) "Prop3"
8) "3"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.