[英]Using Pentaho Kettle, how do I load multiple tables from a single table while keeping referential integrity?
Need to load data from a single file with a 100,000+ records into multiple tables on MySQL maintaining the relationships defined in the file/tables; 需要将具有100,000多条记录的单个文件中的数据加载到MySQL上的多个表中,以维护文件/表中定义的关系; meaning the relationships already match. 意味着关系已经匹配。 The solution should work on the latest version of MySQL, and needs to use the InnoDB engine; 该解决方案应该适用于最新版本的MySQL,并且需要使用InnoDB引擎; MyISAM does not support foreign keys. MyISAM不支持外键。
I am a completely new to using Pentaho Data Integration (aka Kettle) and any pointers would be appreciated. 我是一个全新的使用Pentaho数据集成(aka Kettle),任何指针将不胜感激。
I might add that it is a requirement that the foreign key constraints are NOT disabled. 我可能会补充说,要求不禁用外键约束。 Since it's my understanding that if there is something wrong with the database's referential integrity, MySQL will not check for referential integrity when the foreign key constraints are turned back on. 由于我的理解是,如果数据库的引用完整性存在问题,MySQL将不会在重新打开外键约束时检查引用完整性。 SOURCE: 5.1.4. 消息来源: 5.1.4。 Server System Variables -- foreign_key_checks 服务器系统变量 - foreign_key_checks
All approaches should include some from of validation and a rollback strategy should an insert fail, or fail to maintain referential integrity. 如果插入失败,或者无法保持参照完整性,则所有方法都应包括验证和回滚策略中的一些。
Again, completely new to this, and doing my best to provide as much information as possible, if you have any questions, or request for clarification -- just let me know. 再次,对此全新,并尽力提供尽可能多的信息,如果您有任何问题或要求澄清 - 请告诉我。
If you are able to post the XML from the kjb and ktr files (jobs/transformations) that would be SUPER. 如果您能够从超级的kjb和ktr文件(作业/转换)发布XML。 Might even hunt down every comment/answer you've every made anywhere and up vote them... :-) ...really, it's really important to me to find an answer for this. 甚至可能追捕你在任何地方所做的每一条评论/答案,然后投票给他们...... :-) ......真的,找到答案对我来说真的很重要。
Thanks! 谢谢!
SAMPLE DATA: To better elaborate with an example, lets assume I am trying to load a file containing employee name, the offices they have occupied in the past and their Job title history separated by a tab. 示例数据:为了更好地举例说明,我们假设我正在尝试加载包含员工姓名的文件,他们过去占用的办公室以及由标签分隔的职位名称历史记录。
File: 文件:
EmployeeName<tab>OfficeHistory<tab>JobLevelHistory
John Smith<tab>501<tab>Engineer
John Smith<tab>601<tab>Senior Engineer
John Smith<tab>701<tab>Manager
Alex Button<tab>601<tab>Senior Assistant
Alex Button<tab>454<tab>Manager
NOTE: The single table database is completely normalized (as much as a single table may be) -- and for example, in the case of "John Smith" there is only one John Smith; 注意:单表数据库是完全标准化的(可能只有一个表) - 例如,在“John Smith”的情况下,只有一个John Smith; meaning there are no duplicates that would lead to conflicts in referential integrity. 意味着没有重复会导致参照完整性的冲突。
The MyOffice
database schema has the following tables: MyOffice
数据库模式具有以下表:
Employee (nId, name)
Office (nId, number)
JobTitle (nId, titleName)
Employee2Office (nEmpID, nOfficeId)
Employee2JobTitle (nEmpId, nJobTitleID)
So in this case. 所以在这种情况下。 the tables should look like: 表格应如下所示:
Employee
1 John Smith
2 Alex Button
Office
1 501
2 601
3 701
4 454
JobTitle
1 Engineer
2 Senior Engineer
3 Manager
4 Senior Assistant
Employee2Office
1 1
1 2
1 3
2 2
2 4
Employee2JobTitle
1 1
1 2
1 3
2 4
2 3
Here's the MySQL DDL to create the database and tables: 这是用于创建数据库和表的MySQL DDL:
create database MyOffice2;
use MyOffice2;
CREATE TABLE Employee (
id MEDIUMINT NOT NULL AUTO_INCREMENT,
name CHAR(50) NOT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB;
CREATE TABLE Office (
id MEDIUMINT NOT NULL AUTO_INCREMENT,
office_number INT NOT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB;
CREATE TABLE JobTitle (
id MEDIUMINT NOT NULL AUTO_INCREMENT,
title CHAR(30) NOT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB;
CREATE TABLE Employee2JobTitle (
employee_id MEDIUMINT NOT NULL,
job_title_id MEDIUMINT NOT NULL,
FOREIGN KEY (employee_id) REFERENCES Employee(id),
FOREIGN KEY (job_title_id) REFERENCES JobTitle(id),
PRIMARY KEY (employee_id, job_title_id)
) ENGINE=InnoDB;
CREATE TABLE Employee2Office (
employee_id MEDIUMINT NOT NULL,
office_id MEDIUMINT NOT NULL,
FOREIGN KEY (employee_id) REFERENCES Employee(id),
FOREIGN KEY (office_id) REFERENCES Office(id),
PRIMARY KEY (employee_id, office_id)
) ENGINE=InnoDB;
PREP: PREP:
<TAB>
to comma delimited. (a)使用示例数据,通过将<TAB>
更改为逗号分隔来创建CSV。 Dataflow by Step: (My Notes) 数据流步骤:(我的笔记)
I put together a sample transformation(right click and choose save link) based on what you provided. 我根据您提供的内容整理了一个示例转换(右键单击并选择保存链接) 。 The only step I feel a bit uncertain on is the last table inputs. 我觉得有点不确定的唯一步骤是最后一个表输入。 I'm basically writing the join data to the table and letting it fail if a specific relationship already exists. 我基本上是将连接数据写入表中,如果已存在特定关系则让它失败。
This solution doesn't really meet the "All approaches should include some from of validation and a rollback strategy should an insert fail, or fail to maintain referential integrity." 这个解决方案并没有真正满足“所有方法应该包括一些来自验证和回滚策略,如果插入失败,或者无法保持参照完整性。” criteria, though it probably won't fail. 标准,虽然它可能不会失败。 If you really want to setup something complex we can but this should definitely get you going with these transformations. 如果你真的想要设置复杂的东西我们可以,但这绝对可以让你进行这些转换。
1. We start with reading in your file. 1.我们首先阅读您的文件。 In my case I converted it to CSV but tab is fine too. 在我的情况下,我将其转换为CSV,但标签也很好。
2. Now we're going to insert the employee names into the Employee table using a combination lookup/update
. 2.现在我们将使用combination lookup/update
将员工姓名插入Employee表。 After the insert we append the employee_id to our datastream as id
and remove the EmployeeName
from the data stream. 在插入之后,我们将employee_id作为id
追加到我们的数据流中,并从数据流中删除EmployeeName
。
3. Here we're just using a Select Values step to rename the id
field to employee_id 3.这里我们只是使用Select Values步骤将id
字段重命名为employee_id
4. Insert Job Titles just like we did employees and append the title id to our datastream also deleting the JobLevelHistory
from the datastream. 4.像我们做员工一样插入作业标题,并将标题ID附加到我们的数据流中,同时从数据流中删除JobLevelHistory
。
5. Simple rename of the title id to title_id(see step 3) 5.将标题ID简单重命名为title_id(参见步骤3)
6. Insert offices, get id's, remove OfficeHistory from the stream. 6.插入office,获取id,从流中删除OfficeHistory。
7. Simple rename of the office id to office_id(see step 3) 7.将office id简单重命名为office_id(参见步骤3)
8. Copy Data from the last step into two streams with the values employee_id,office_id
and employee_id,title_id
respectively. 8.将数据从最后一步复制到两个流,其值分别为employee_id,office_id
和employee_id,title_id
。
9. Use a table insert to insert the join data. 9.使用表插入来插入连接数据。 I've got it selected to ignore insert errors as there could be duplicates and the PK constraints will make some rows fail. 我已选择忽略插入错误,因为可能存在重复,PK约束会使某些行失败。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.