简体   繁体   中英

How to read multiple CSV files in Spring Batch to merge the data for processing?

I'm new to Spring Batch and trying to get some guidance for below requirement.

Overall Requirement:

I've to get data from different systems, apply some business logic, save the result in DB.

Below is an example.

I need to read data from 3 CSV files. First file – person.csv – contains name and id Second File – address.csv – contains address info for each person. One person can have zero or multiple addresses.
Third File – employment.csv – contains employment info for each person. One person can have zero or multiple employers.

Here is some sample.

Person.csv### (total size is 8 millions)

"personID", "personName"

1, Joey

2, Chandler

3, Ross

4, Monica

Address.csv

"personID", "addressType", "state"

1, residence, NY

1, mailing, NC

2, residence, NY

4, residence, NY

4, mailing, DC

Employment.csv

"personID", "employerName"

1, emp1

2, emp2

2, emp3

3, emp4

Note: each file is sorted by person id.

To apply the business logic, I need to merge the data for each person, ie, I need to merge person, address, employment data for one person to apply the logic. Can you suggest any approach for this.

It sounds like a 4 step , job. You'll have to decide where the intermediate results of steps 1 to 3 should reside.

If the data from all the CSV files will fit in memory, then the intermediate results of steps 1 to 3 could just be a Map , with personID as the key. If not, then the intermediate results of steps 1 to 3 should probably be written to a temp table in the database.

Assuming all data will fit in memory, create a bean which can be injected into the ItemWriters of steps 1 to 3, for example:

// in a config class...
// assuming PersonID is of type Long
// Assuming Person class has appropriate attributes
Map<Long, Person> people = new HashMap<>();

Step 1:

  • ItemReader - reads the next Person.CSV row and creates a Person instance
  • ItemProcessor - nothing to do - pass the Person instance to the ItemWriter
  • ItemWriter - adds the Person instance to the people Map (or intermediate table).

Step 2:

  • ItemReader - reads the next Address.CSV row and creates an Address instance
  • ItemProcessor - nothing to do - pass the Address instance to the ItemWriter
  • ItemWriter - adds the Address to the related Person from the people Map (or intermediate table). TODO: what should happen if there is an Address for a person that does not exist?

Step 3:

  • ItemReader - reads the next Employment.CSV row and creates an Employment instance
  • ItemProcessor - nothing to do - pass the Employment instance to the ItemWriter
  • ItemWriter - adds the Employment to the related Person from the people Map (or intermediate table). TODO: what should happen if there is an Employment for a person that does not exist?

Since there is nothing for ItemProcessor to do in steps 1 to 3, it might be better to use a Tasklet.

Also, steps 1 to 3 could be done in parallel. It would probably increase performance, but there would be added complexity to ensure people is correctly populated.

Step 4:

  • ItemReader - reads the next element of people (or composite object from intermediate tables)
  • ItemProcessor - apply business logic
  • ItemWriter - write result to database

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM