I'm looking for the best solution for a problem that i have (-:
i have k csv files (5 csv files for example), each file has m fields which produce a key and n values. I need to produce a single csv file with aggregated data. for example
file 1: f1,f2,f3,v1,v2,v3,v4
a1,b1,c1,50,60,70,80
a3,b2,c4,60,60,80,90
file 2: f1,f2,f3,v1,v2,v3,v4
a1,b1,c1,30,50,90,40
a3,b2,c4,30,70,50,90
result: f1,f2,f3,v1,v2,v3,v4
a1,b1,c1,80,110,160,120
a3,b2,c4,90,130,130,180
algorithm that we thought until now:
hashing (using concurentHashTable)
merge sorting the files
DB: using mysql or hadoop.
The solution needs to be able to handle Huge amount of data (each file more than two million rows)
a better example: file 1
country,city,peopleNum
england,london,1000000
england,coventry,500000
file 2:
country,city,peopleNum
england,london,500000
england,coventry,500000
england,manchester,500000
merged file:
country,city,peopleNum
england,london,1500000
england,coventry,1000000
england,manchester,500000
the key is: country,city of course...this is just an example...my real key is of size 6 and the data columns are of size 8 - total of 14 columns
I think the answer really depends
1) If you need a ready made solution then splunk might be way yo go ( http://splunk-base.splunk.com/answers/6783/handling-large-amount-of-csv-files-as-input-and-rename-sourcetype-as-well-as-specify-header )
2) IF you have infrastructure / bandwidth / development tume for Hadoop then go create a solution
3) if this is a one time job create a merge sort solution (i have processed 2 TB files in bash using sed / awk / sort)
4) A custom solution if you don't like any of the above.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.