I am using azure dataflow to transform delimited files (csv/txt) to json. But I want to separate the files dynamically based on a max row count of 5,000 because I will not know the row count every time. So if I have a csv file with 10,000 rows the pipeline will output two equal json files, file1.json and file2.json. What is the best way to actually get the row count of my sources and the correct n number of partitions based on that row count within Azure Data Factory?
We can't specify the row number to split the csv file. The most closest workaround is specify the partition of the Sink.
For example, I have a csv file contains 700 rows data. I successfully copy to two equal json files.
My source csv data in Blob storage:
Sink settings: each partition output a new file: json1.json
and json2.json
:
Optimize:
Set partition
Dynamic partition
2
(means split the csv data to 2 partitions)id
(split based on the id
column)Run the Data flow and the csv file will split to two json files which each contains 350 rows data.
For your situation, the csv file with 10,000 rows the pipeline will output two equal json files(each contains 500 row data).
One way to achieve this is to use the mod or % operator.
Your output should now be a array of rows with the expected count in each.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.