Suppose I am having the following csv files with the following contents
file_20190901.csv
col1 col2 col3
data 20190901 A
file_20190902.csv
col1 col2 col3
data 20190901 B
data 20190902 A
So somedays later, having filename file_20190903.csv will have
col1 col2 col3
data 20190902 B
data 20190903 A
So the tasks now is to merge these csv files in the data frame, and including all the records 20190901 to 20190903 on col2, and using the lastest row data. So desire the results to be
col1 col2 col3
data 20190901 B
data 20190902 B
data 20190903 A
How to do this in Databricks using Python?
From the sample files, col2 has the same values but col3 has different values. So, you cannot merge both the files.
file_20190901.csv
col1 col2 col3
data 20190901 A
file_20190902.csv
col1 col2 col3
data 20190901 B
How to read multiple csv files:
Copy all the csv files to the dbfs as shown:
Then create a python notebook and run as follows:
ReadMultiple = spark.read.format("csv").option("header", "true").load("/sample/*.csv")
display(ReadMultiple)
Hope this helps.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.