使用Pentaho Kettle，如何在保持参照完整性的同时从单个表中加载多个表？

Question

Need to load data from a single file with a 100,000+ records into multiple tables on MySQL maintaining the relationships defined in the file/tables; 需要将具有100,000多条记录的单个文件中的数据加载到MySQL上的多个表中，以维护文件/表中定义的关系; meaning the relationships already match. 意味着关系已经匹配。 The solution should work on the latest version of MySQL, and needs to use the InnoDB engine; 该解决方案应该适用于最新版本的MySQL，并且需要使用InnoDB引擎; MyISAM does not support foreign keys. MyISAM不支持外键。

I am a completely new to using Pentaho Data Integration (aka Kettle) and any pointers would be appreciated. 我是一个全新的使用Pentaho数据集成（aka Kettle），任何指针将不胜感激。

I might add that it is a requirement that the foreign key constraints are NOT disabled. 我可能会补充说，要求不禁用外键约束。 Since it's my understanding that if there is something wrong with the database's referential integrity, MySQL will not check for referential integrity when the foreign key constraints are turned back on. 由于我的理解是，如果数据库的引用完整性存在问题，MySQL将不会在重新打开外键约束时检查引用完整性。 SOURCE: 5.1.4. 消息来源： 5.1.4。 Server System Variables -- foreign_key_checks 服务器系统变量 - foreign_key_checks

All approaches should include some from of validation and a rollback strategy should an insert fail, or fail to maintain referential integrity. 如果插入失败，或者无法保持参照完整性，则所有方法都应包括验证和回滚策略中的一些。

Again, completely new to this, and doing my best to provide as much information as possible, if you have any questions, or request for clarification -- just let me know. 再次，对此全新，并尽力提供尽可能多的信息，如果您有任何问题或要求澄清 - 请告诉我。

If you are able to post the XML from the kjb and ktr files (jobs/transformations) that would be SUPER. 如果您能够从超级的kjb和ktr文件（作业/转换）发布XML。 Might even hunt down every comment/answer you've every made anywhere and up vote them... :-) ...really, it's really important to me to find an answer for this. 甚至可能追捕你在任何地方所做的每一条评论/答案，然后投票给他们...... :-) ......真的，找到答案对我来说真的很重要。

Thanks! 谢谢！

SAMPLE DATA: To better elaborate with an example, lets assume I am trying to load a file containing employee name, the offices they have occupied in the past and their Job title history separated by a tab. 示例数据：为了更好地举例说明，我们假设我正在尝试加载包含员工姓名的文件，他们过去占用的办公室以及由标签分隔的职位名称历史记录。

File: 文件：

EmployeeName<tab>OfficeHistory<tab>JobLevelHistory
John Smith<tab>501<tab>Engineer
John Smith<tab>601<tab>Senior Engineer
John Smith<tab>701<tab>Manager
Alex Button<tab>601<tab>Senior Assistant
Alex Button<tab>454<tab>Manager

NOTE: The single table database is completely normalized (as much as a single table may be) -- and for example, in the case of "John Smith" there is only one John Smith; 注意：单表数据库是完全标准化的（可能只有一个表） - 例如，在“John Smith”的情况下，只有一个John Smith; meaning there are no duplicates that would lead to conflicts in referential integrity. 意味着没有重复会导致参照完整性的冲突。

The MyOffice database schema has the following tables: MyOffice数据库模式具有以下表：

Employee (nId, name)
Office (nId, number)
JobTitle (nId, titleName)
Employee2Office (nEmpID, nOfficeId)
Employee2JobTitle (nEmpId, nJobTitleID)

So in this case. 所以在这种情况下。 the tables should look like: 表格应如下所示：

Employee
1 John Smith
2 Alex Button

Office
1 501
2 601
3 701
4 454

JobTitle
1 Engineer
2 Senior Engineer
3 Manager
4 Senior Assistant

Employee2Office
1 1
1 2
1 3
2 2
2 4

Employee2JobTitle
1 1
1 2
1 3
2 4
2 3

Here's the MySQL DDL to create the database and tables: 这是用于创建数据库和表的MySQL DDL：

create database MyOffice2;

use MyOffice2;

CREATE TABLE Employee (
      id MEDIUMINT NOT NULL AUTO_INCREMENT,
      name CHAR(50) NOT NULL,
      PRIMARY KEY (id)
    ) ENGINE=InnoDB;

CREATE TABLE Office (
  id MEDIUMINT NOT NULL AUTO_INCREMENT,
  office_number INT NOT NULL,
  PRIMARY KEY (id)
) ENGINE=InnoDB;

CREATE TABLE JobTitle (
  id MEDIUMINT NOT NULL AUTO_INCREMENT,
  title CHAR(30) NOT NULL,
  PRIMARY KEY (id)
) ENGINE=InnoDB;

CREATE TABLE Employee2JobTitle (
  employee_id MEDIUMINT NOT NULL,
  job_title_id MEDIUMINT NOT NULL,
  FOREIGN KEY (employee_id) REFERENCES Employee(id),
  FOREIGN KEY (job_title_id) REFERENCES JobTitle(id),
  PRIMARY KEY (employee_id, job_title_id)
) ENGINE=InnoDB;

CREATE TABLE Employee2Office (
  employee_id MEDIUMINT NOT NULL,
  office_id MEDIUMINT NOT NULL,
  FOREIGN KEY (employee_id) REFERENCES Employee(id),
  FOREIGN KEY (office_id) REFERENCES Office(id),
  PRIMARY KEY (employee_id, office_id)
) ENGINE=InnoDB;

My Notes in Response to Selected Answer: 我对回答所选答案的说明：

PREP: PREP：

(a) Use the sample data, create a CSV by changing <TAB> to comma delimited. （a）使用示例数据，通过将<TAB>更改为逗号分隔来创建CSV。
(b) Install MySQL and create sample database using the MySQL DDL sample （b）使用MySQL DDL示例安装MySQL并创建示例数据库
(c) Install Kettle (it's Java based and will run on anything that runs Java) （c）安装Kettle（它是基于Java的，可以运行任何运行Java的东西）
(d) Download KTR file （d）下载KTR文件

Dataflow by Step: (My Notes) 数据流步骤:(我的笔记）

Open the KTR file in Kettle, and double clicked the "CSV file input" and browse to the CSV file that you created. 在Kettle中打开KTR文件，然后双击“CSV文件输入”并浏览到您创建的CSV文件。 The delimiter should already be set to comma. 分隔符应已设置为逗号。 Then click OKAY. 然后单击OKAY。
Double click "Insert Employees" and select DB connector then follow these directions on Creating a New Database Connection 双击“插入员工”并选择数据库连接器，然后按照创建新数据库连接上的这些说明进行操作

Answer 1

I put together a sample transformation(right click and choose save link) based on what you provided. 我根据您提供的内容整理了一个示例转换（右键单击并选择保存链接）。 The only step I feel a bit uncertain on is the last table inputs. 我觉得有点不确定的唯一步骤是最后一个表输入。 I'm basically writing the join data to the table and letting it fail if a specific relationship already exists. 我基本上是将连接数据写入表中，如果已存在特定关系则让它失败。

note: 注意：

This solution doesn't really meet the "All approaches should include some from of validation and a rollback strategy should an insert fail, or fail to maintain referential integrity." 这个解决方案并没有真正满足“所有方法应该包括一些来自验证和回滚策略，如果插入失败，或者无法保持参照完整性。” criteria, though it probably won't fail. 标准，虽然它可能不会失败。 If you really want to setup something complex we can but this should definitely get you going with these transformations. 如果你真的想要设置复杂的东西我们可以，但这绝对可以让你进行这些转换。

替代文字

Dataflow by Step 数据流逐步

1. We start with reading in your file. 1.我们首先阅读您的文件。 In my case I converted it to CSV but tab is fine too. 在我的情况下，我将其转换为CSV，但标签也很好。 替代文字

2. Now we're going to insert the employee names into the Employee table using a combination lookup/update . 2.现在我们将使用combination lookup/update将员工姓名插入Employee表。 After the insert we append the employee_id to our datastream as id and remove the EmployeeName from the data stream. 在插入之后，我们将employee_id作为id追加到我们的数据流中，并从数据流中删除EmployeeName 。

替代文字

3. Here we're just using a Select Values step to rename the id field to employee_id 3.这里我们只是使用Select Values步骤将id字段重命名为employee_id 替代文字

4. Insert Job Titles just like we did employees and append the title id to our datastream also deleting the JobLevelHistory from the datastream. 4.像我们做员工一样插入作业标题，并将标题ID附加到我们的数据流中，同时从数据流中删除JobLevelHistory 。

替代文字

5. Simple rename of the title id to title_id(see step 3) 5.将标题ID简单重命名为title_id（参见步骤3） 替代文字

6. Insert offices, get id's, remove OfficeHistory from the stream. 6.插入office，获取id，从流中删除OfficeHistory。

替代文字

7. Simple rename of the office id to office_id(see step 3) 7.将office id简单重命名为office_id（参见步骤3）

替代文字

8. Copy Data from the last step into two streams with the values employee_id,office_id and employee_id,title_id respectively. 8.将数据从最后一步复制到两个流，其值分别为employee_id,office_id和employee_id,title_id 。

替代文字

9. Use a table insert to insert the join data. 9.使用表插入来插入连接数据。 I've got it selected to ignore insert errors as there could be duplicates and the PK constraints will make some rows fail. 我已选择忽略插入错误，因为可能存在重复，PK约束会使某些行失败。