Databricks CSV multiple read

Question

Suppose I am having the following csv files with the following contents

file_20190901.csv

col1       col2         col3
data       20190901     A

file_20190902.csv

col1       col2         col3
data       20190901     B
data       20190902     A

So somedays later, having filename file_20190903.csv will have

col1    col2         col3
data       20190902     B
data       20190903     A

So the tasks now is to merge these csv files in the data frame, and including all the records 20190901 to 20190903 on col2, and using the lastest row data. So desire the results to be

col1    col2         col3
data    20190901     B 
data    20190902     B
data    20190903     A

How to do this in Databricks using Python?

Answer 1

From the sample files, col2 has the same values but col3 has different values. So, you cannot merge both the files.

file_20190901.csv

col1 col2 col3

data 20190901 A

file_20190902.csv

col1 col2 col3

data 20190901 B

How to read multiple csv files:

Copy all the csv files to the dbfs as shown:

Then create a python notebook and run as follows:

ReadMultiple = spark.read.format("csv").option("header", "true").load("/sample/*.csv")
display(ReadMultiple)

Hope this helps.

Databricks CSV multiple read

Question

1 answers

solution1
1 2019-09-26 06:43:30

Databricks CSV multiple read

Question

1 answers

solution1 1 2019-09-26 06:43:30

solution1
1 2019-09-26 06:43:30