简体   繁体   English

如何在Spring Batch中读取多个CSV文件以合并数据进行处理?

[英]How to read multiple CSV files in Spring Batch to merge the data for processing?

I'm new to Spring Batch and trying to get some guidance for below requirement. 我是Spring Batch的新手,正在尝试获取有关以下要求的指南。

Overall Requirement: 总体要求:

I've to get data from different systems, apply some business logic, save the result in DB. 我必须从不同的系统中获取数据,应用一些业务逻辑,并将结果保存在DB中。

Below is an example. 下面是一个例子。

I need to read data from 3 CSV files. 我需要从3个CSV文件中读取数据。 First file – person.csv – contains name and id Second File – address.csv – contains address info for each person. 第一个文件– person.csv –包含名称和ID。第二个文件– address.csv –包含每个人的地址信息。 One person can have zero or multiple addresses. 一个人可以有零个或多个地址。
Third File – employment.csv – contains employment info for each person. 第三个文件– Employment.csv –包含每个人的就业信息。 One person can have zero or multiple employers. 一个人可以有零个或多个雇主。

Here is some sample. 这是一些示例。

Person.csv### (total size is 8 millions) Person.csv ###(总大小为800万)

"personID", "personName" “ personID”,“ personName”

1, Joey 1,乔伊

2, Chandler 2,钱德勒

3, Ross 3,罗斯

4, Monica 4,莫妮卡

Address.csv 地址.csv

"personID", "addressType", "state" “ personID”,“ addressType”,“ state”

1, residence, NY 纽约市1号住宅

1, mailing, NC 1,邮寄,数控

2, residence, NY 纽约市2号住宅

4, residence, NY 纽约市4号住宅

4, mailing, DC 4,邮寄,DC

Employment.csv 职业.csv

"personID", "employerName" “ personID”,“ employerName”

1, emp1 1,emp1

2, emp2 2,emp2

2, emp3 2,emp3

3, emp4 3,emp4

Note: each file is sorted by person id. 注意:每个文件均按人员ID排序。

To apply the business logic, I need to merge the data for each person, ie, I need to merge person, address, employment data for one person to apply the logic. 要应用业务逻辑,我需要合并每个人的数据,即,我需要合并一个人的人,地址,就业数据以应用逻辑。 Can you suggest any approach for this. 您能为此建议任何方法吗?

It sounds like a 4 step , job. 这听起来像一个4 ,工作。 You'll have to decide where the intermediate results of steps 1 to 3 should reside. 您必须确定步骤1到3的中间结果应该在哪里。

If the data from all the CSV files will fit in memory, then the intermediate results of steps 1 to 3 could just be a Map , with personID as the key. 如果所有CSV文件中的数据都可以存储在内存中,则步骤1至3的中间结果可能只是一个Map ,并以personID作为键。 If not, then the intermediate results of steps 1 to 3 should probably be written to a temp table in the database. 如果不是,则步骤1至3的中间结果可能应该写入数据库的临时表中。

Assuming all data will fit in memory, create a bean which can be injected into the ItemWriters of steps 1 to 3, for example: 假设所有数据都可以容纳在内存中,请创建一个可以注入到步骤1到步骤3的ItemWriters中的bean,例如:

// in a config class...
// assuming PersonID is of type Long
// Assuming Person class has appropriate attributes
Map<Long, Person> people = new HashMap<>();

Step 1: 第1步:

  • ItemReader - reads the next Person.CSV row and creates a Person instance ItemReader-读取下一个Person.CSV行并创建一个Person实例
  • ItemProcessor - nothing to do - pass the Person instance to the ItemWriter ItemProcessor-无关紧要-将Person实例传递给ItemWriter
  • ItemWriter - adds the Person instance to the people Map (or intermediate table). ItemWriter -增加Person实例到people地图(或中间表)。

Step 2: 第2步:

  • ItemReader - reads the next Address.CSV row and creates an Address instance ItemReader-读取下一个Address.CSV行并创建一个Address实例
  • ItemProcessor - nothing to do - pass the Address instance to the ItemWriter ItemProcessor-无关紧要-将Address实例传递给ItemWriter
  • ItemWriter - adds the Address to the related Person from the people Map (or intermediate table). ItemWriter-将地址从people映射(或中间表)添加到相关的人员。 TODO: what should happen if there is an Address for a person that does not exist? 待办事项:如果不存在某人的地址,该怎么办?

Step 3: 第三步:

  • ItemReader - reads the next Employment.CSV row and creates an Employment instance ItemReader-读取下一个Job.CSV行并创建一个Job实例
  • ItemProcessor - nothing to do - pass the Employment instance to the ItemWriter ItemProcessor-无关紧要-将Jobing实例传递给ItemWriter
  • ItemWriter - adds the Employment to the related Person from the people Map (or intermediate table). ItemWriter-从people图(或中间表)向相关人员添加就业。 TODO: what should happen if there is an Employment for a person that does not exist? 待办事项:如果某人不存在工作,该怎么办?

Since there is nothing for ItemProcessor to do in steps 1 to 3, it might be better to use a Tasklet. 由于第1到第3步中ItemProcessor不需要执行任何操作,因此最好使用Tasklet。

Also, steps 1 to 3 could be done in parallel. 同样,步骤1至3可以并行进行。 It would probably increase performance, but there would be added complexity to ensure people is correctly populated. 这可能会提高性能,但是会增加复杂性以确保people正确填充。

Step 4: 第四步:

  • ItemReader - reads the next element of people (or composite object from intermediate tables) ItemReader -读取的下一个元素people (或从中间表复合对象)
  • ItemProcessor - apply business logic ItemProcessor-应用业务逻辑
  • ItemWriter - write result to database ItemWriter-将结果写入数据库

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM