简体   繁体   中英

Split part files based on date column using pyspark

I have 200 csv part files which are separated by year from 2012 till 2018. I further want to split the csv files based on the date column which is present in it using pyspark . Would like to know an efficient way to do this, since the csv would contain millions of rows.

My current approach is to - read all the csv files for 2012 into a dataframe - then for all the, 365 days i loop through the above dataframe and then write the contents to the csv by date.

Is there any other efficient way to achieve this pyspark.

I have put sample data below:

> 1234|2012-01-01|abc|def|455 
> 
> 1278|2012-04-05|duuj|dea|457
> 
> 9998|2012-05-09|dimd|ase|759
> 
> 8892|2012-01-01|eedbnd|ss|378
> 
> 178|2012-04-05|dswuj|ada|47
> 
> 278|2012-04-05|d32j|d12a|421

I need this data to be written into 3 separate csv files containing data for 2012-01-01 , 2012-04-05 and 2012-05-09

There are 3 dates in sample data - 01-01, 04-05, 05-09

def fn(dt):
  return hash(dt)

Create a key, value pair with key being the date

rdd = sc.textFile('path/your_file.txt',3).map(lambda r: r.split('|')).map(lambda r: (r[1],r))

Generate a hash for the key and pass it to partitionBy

rdd.partitionBy(3, fn).saveAsTextFile('partitioned_parts')

You should now see 3 part files each having specific dates.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM