简体   繁体   中英

Azure Data Factory DYNAMICALLY partition a csv/txt file based on rowcount

I am using azure dataflow to transform delimited files (csv/txt) to json. But I want to separate the files dynamically based on a max row count of 5,000 because I will not know the row count every time. So if I have a csv file with 10,000 rows the pipeline will output two equal json files, file1.json and file2.json. What is the best way to actually get the row count of my sources and the correct n number of partitions based on that row count within Azure Data Factory?

We can't specify the row number to split the csv file. The most closest workaround is specify the partition of the Sink.

For example, I have a csv file contains 700 rows data. I successfully copy to two equal json files.

My source csv data in Blob storage: 在此处输入图像描述

Sink settings: each partition output a new file: json1.json and json2.json :

在此处输入图像描述

Optimize:

  1. Partition operation: Set partition
  2. Partition type: Dynamic partition
  3. Number of partitions: 2 (means split the csv data to 2 partitions)
  4. Stored ranges in columns: id (split based on the id column)

在此处输入图像描述

Run the Data flow and the csv file will split to two json files which each contains 350 rows data.

For your situation, the csv file with 10,000 rows the pipeline will output two equal json files(each contains 500 row data).

One way to achieve this is to use the mod or % operator.

  1. To start with set a surrogate key on the CSV file or use any sequential key in the data.
  2. Add a aggregate step with a group by clause that is your key % row count
  3. Set the Aggregates function to collect()

Your output should now be a array of rows with the expected count in each.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM