简体   繁体   中英

Databricks CSV multiple read

Suppose I am having the following csv files with the following contents

file_20190901.csv

col1       col2         col3
data       20190901     A

file_20190902.csv

col1       col2         col3
data       20190901     B
data       20190902     A

So somedays later, having filename file_20190903.csv will have

col1    col2         col3
data       20190902     B
data       20190903     A

So the tasks now is to merge these csv files in the data frame, and including all the records 20190901 to 20190903 on col2, and using the lastest row data. So desire the results to be

col1    col2         col3
data    20190901     B 
data    20190902     B
data    20190903     A

How to do this in Databricks using Python?

From the sample files, col2 has the same values but col3 has different values. So, you cannot merge both the files.

file_20190901.csv

col1 col2 col3

data 20190901 A

file_20190902.csv

col1 col2 col3

data 20190901 B

How to read multiple csv files:

Copy all the csv files to the dbfs as shown:

在此处输入图像描述

Then create a python notebook and run as follows:

ReadMultiple = spark.read.format("csv").option("header", "true").load("/sample/*.csv")
display(ReadMultiple)

在此处输入图像描述

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM