简体   繁体   中英

Serializing SDO_GEOMETRY type to text really slow

I am trying for a couple of days now to extract SDO_GEOMETRY records from an Oracle table into a CSV file via Microsoft Azure Data Factory (gen2). My select statement looks like this:

select t.MY_GEOM.get_WKT() from my_table t

where MY_GEOM column is of type SDO_GEOMETRY. It works but it's really, really slow. About 2 hours to pull 74000 records via this method. Without that conversion (so, plain select without .get_wkt() takes about 32 seconds, but of course the result is rubbish and unusable.

Is there some way to speed up the process? My guess it's that the problem is on the server side, but I'm not a DBA and don't have direct access to it. I can connect to it via SQL Developer or from Data Factory.

The data contained there is just some LINESTRING(x1 y1, x2 y2, ...)

I also tried running SDO_UTIL.TO_WKTGEOMETRY to convert it, but it's equally slow.

If you have any suggestions, please let me know.

Kind regards, Tudor

As i know,no additional burden will be imposed on data sources or sinks in ADF,so looks like that is a performance bottleneck at the db side with get_WKT() method.

Of course,you could refer to the tuning guides in this link to improve your transfer performance.Especially for Parallel copy . For each copy activity run, Azure Data Factory determines the number of parallel copies to use to copy data from the source data store and to the destination data store.That's based on the DIU .

I found a nice solution while searching for different approaches. As stated in some comments above, this solution that works for me consists of two steps:

  1. Split the SDO_GEOMETRY LINESTRING entry into its coordinates via the following select statement
SELECT t.id, nt.COLUMN_VALUE AS coordinates, rownum FROM my_table t, TABLE(t.SDO_GEOMETRY.SDO_ORDINATES) nt 

I just use it in a plain Copy Activity in Azure Data Factory to save my raw files as CSVs into a Data Lake. The files are quite large, about 4 times bigger than the final version created by the next step

  1. Aggregate the coordinates back into a string via some Databricks Scala Spark code
val mergeList = udf { strings: Seq[String] => strings.mkString(", ") } 

val result = df.withColumn("collected", 
     collect_list($"coordinates").over(Window.partitionBy("id").orderBy("rownum")) 
  ) 
  .groupBy("id") 
  .agg(max($"collected").as("collected")) 
  .withColumn("final_coordinates", mergeList($"collected")) 
  .select("id", "final_coordinates")

val outputFilePrefix = s"$dataLakeFolderPath/$tableName"
val tmpOutputFolder = s"$outputFilePrefix.tmp"

result
  .coalesce(1)
  .write
  .option("header", "true")
  .csv(tmpOutputFolder)

dbutils.fs.cp(partition_path, s"$outputFilePrefix.csv")
dbutils.fs.rm(tmpOutputFolder, recurse = true)

The final_coordinates column contains my coordinates in the proper order (I had some issues with this). And I can plainly save the file back into my storage account. In the end, I only keep the proper CSV file that I am interested in.

As I said, it's quite fast. It takes about 2.5 minutes for my first step and a couple of seconds for the second one compared to 2 hours, so, I'm quite happy with this solution.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM