简体   繁体   中英

Azure Datafactory , multi level complex csv structure

We have to deliver a rather complex csv structure and we would like to use Data factory for this. The structure has multiple levels with a global header and trailer+ a subheader (per topic) and it's detail lines. The first column defines which type of line it is. I've simplified the real format just to highlight the questions I have.

HEADER - common data like export date and number sequence SUBHEADER - topic name 1 DETAIL - detail line of above topic DETAIL - detail line of above topic DETAIL - detail line of above topic SUBHEADER - topic name 2 DETAIL - detail line of above topic DETAIL - detail line of above topic DETAIL - detail line of above topic TRAILER - A closing line with total linecount

The source data would be the detail lines + the topic name.

There are 2 problems I'm unable to solve:

  1. How do I convert the source data into the complex SUBHEADER + DETAIL format. To be honest no clue on how to approach this.
  2. Is there a way to add the global header + trailer with total linecount via Datafactory? An alternative would be doing this with an azure function.

All suggestions are welcome...

Regards, Sven Peeters

You have a couple of choices with Azure Data Factory:

  • take an ELT approach where you use some type of compute (eg a SQL database, Databricks, Azure Batch, Azure Function or Azure Synapse serverless SQL pools if you're working in Synapse) to do the hard work structuring the file and outputting it. ADF is really just doing the orchestration (telling other processes what to do in what order) and handling the output. The compute is handling the fiddly bit.
  • take an ETL approach and use Mapping Data Flows. This is a low-code approach which uses on-demand Spark clusters in the background. You do not have to manage them.

I would be tempted to use SQL to do this, particularly if you already have some in your infrastructure. A simplified example:

;WITH cte AS (
SELECT 10 sortOrder, 'someHeader' main
UNION ALL
SELECT 20, 'col1, col2, col3'
--FROM someTable
UNION ALL
SELECT 30, 'someFooter'
)
SELECT main
FROM cte
ORDER BY sortOrder;

If you've got time, why not try both approaches as a proof-of-concept, see which works best for you, your data and your organisation. Look at factors such at time to develop, maintainability, how flexible, cost etc

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM