Split part files based on date column using pyspark

Question

I have 200 csv part files which are separated by year from 2012 till 2018. I further want to split the csv files based on the date column which is present in it using pyspark . Would like to know an efficient way to do this, since the csv would contain millions of rows.

My current approach is to - read all the csv files for 2012 into a dataframe - then for all the, 365 days i loop through the above dataframe and then write the contents to the csv by date.

Is there any other efficient way to achieve this pyspark.

I have put sample data below:

> 1234|2012-01-01|abc|def|455 
> 
> 1278|2012-04-05|duuj|dea|457
> 
> 9998|2012-05-09|dimd|ase|759
> 
> 8892|2012-01-01|eedbnd|ss|378
> 
> 178|2012-04-05|dswuj|ada|47
> 
> 278|2012-04-05|d32j|d12a|421

I need this data to be written into 3 separate csv files containing data for 2012-01-01 , 2012-04-05 and 2012-05-09

Answer 1

There are 3 dates in sample data - 01-01, 04-05, 05-09

def fn(dt):
  return hash(dt)

Create a key, value pair with key being the date

rdd = sc.textFile('path/your_file.txt',3).map(lambda r: r.split('|')).map(lambda r: (r[1],r))

Generate a hash for the key and pass it to partitionBy

rdd.partitionBy(3, fn).saveAsTextFile('partitioned_parts')

You should now see 3 part files each having specific dates.

Split part files based on date column using pyspark

Question

1 answers

solution1
0 2018-03-28 09:03:40

Split part files based on date column using pyspark

Question

1 answers

solution1 0 2018-03-28 09:03:40

solution1
0 2018-03-28 09:03:40