I have 200 csv
part files which are separated by year from 2012 till 2018. I further want to split the csv
files based on the date column which is present in it using pyspark
. Would like to know an efficient way to do this, since the csv
would contain millions of rows.
My current approach is to - read all the csv
files for 2012 into a dataframe - then for all the, 365 days i loop through the above dataframe and then write the contents to the csv by date.
Is there any other efficient way to achieve this pyspark.
I have put sample data below:
> 1234|2012-01-01|abc|def|455
>
> 1278|2012-04-05|duuj|dea|457
>
> 9998|2012-05-09|dimd|ase|759
>
> 8892|2012-01-01|eedbnd|ss|378
>
> 178|2012-04-05|dswuj|ada|47
>
> 278|2012-04-05|d32j|d12a|421
I need this data to be written into 3 separate csv files containing data for 2012-01-01 , 2012-04-05 and 2012-05-09
There are 3 dates in sample data - 01-01, 04-05, 05-09
def fn(dt):
return hash(dt)
Create a key, value pair with key being the date
rdd = sc.textFile('path/your_file.txt',3).map(lambda r: r.split('|')).map(lambda r: (r[1],r))
Generate a hash for the key and pass it to partitionBy
rdd.partitionBy(3, fn).saveAsTextFile('partitioned_parts')
You should now see 3 part files each having specific dates.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.